Adversarial discriminative sim-to-real transfer of visuo-motor policies

Abstract

Various approaches have been proposed to learn visuo-motor policies for real-world robotic applications. One solution is first learning in simulation then transferring to the real world. In the transfer, most existing approaches need real-world images with labels. However, the labeling process is often expensive or even impractical in many robotic applications. In this article, we introduce an adversarial discriminative sim-to-real transfer approach to reduce the amount of labeled real data required. The effectiveness of the approach is demonstrated with modular networks in a table-top object-reaching task where a seven-degree-of-freedom arm is controlled in velocity mode to reach a blue cuboid in clutter through visual observations from a monocular RGB camera. The adversarial transfer approach reduced the labeled real data requirement by 50%. Policies can be transferred to real environments with only 93 labeled and 186 unlabeled real images. The transferred visuo-motor policies are robust to novel (not seen in training) objects in clutter and even a moving target, achieving a 97.8% success rate and 1.8 cm control accuracy. Datasets and code are openly available.

Keywords

Sim-to-real transfer adversarial transfer learning domain adaptation visuo-motor policy learning robotic reaching

1. Introduction

The advent of large datasets and sophisticated machine learning models, commonly referred to as deep learning, has in recent years created a trend away from hand-crafted solutions towards more data-driven ones. Learning techniques have shown significant improvements in robustness and performance since early work (Krizhevsky et al., 2012), particularly in the computer vision field.

Traditionally robotic vision-based reaching approaches have been based on crafted controllers that combine (heuristic) motion planners with the use of hand-crafted features to localize the target visually. Recently, learning approaches to tackle this problem have been presented (Bateux et al., 2018; Katyal et al., 2017; Levine et al., 2016b; Sünderhauf et al., 2018; Zhang et al., 2017a,b, 2015); however, a consistent issue faced by most approaches is the reliance on large amounts of data to train these models. Generalization is another challenge: many current systems are brittle when learned models are applied to real robotic configurations or scenarios that differ from those used in training. This leads to the question: How can we better learn and transfer visuo-motor policies on robots for tasks such as reaching?

Various approaches have been proposed to address this problem. Some works tried to directly learn from large-scale real-world datasets (Levine et al., 2016b; Pinto and Gupta, 2016). However, collecting a large amount of real data could be expensive in robotic applications. For example, an “arm farm” with 6–14 physical robots was developed to collect data in parallel for learning robotic grasping (Levine et al., 2016b). Therefore, some methods were proposed to reduce the cost of collecting a large amount of real-world data by using simulated or synthetic data (Bateux et al., 2018; D'Innocente et al., 2017; James et al., 2017; Tobin et al., 2017). Some others tried to make use of both simulated and real data for a more balanced solution (Fitzgerald et al., 2015; Tzeng et al., 2016). A particular approach is modular deep Q-networks for learning a planar reaching task in simulation and then transferring to real environments with a small number of labeled real-world images (Zhang et al., 2017a,b).

In this work, we extend the modular approach (Zhang et al., 2017a) and focus on making use of both simulated and real data to learn robotic skills. In the modular deep Q-networks, labeled real images were previously used for the transfer. Although the amount of data was small, the labeling cost was non-trivial. In comparison, images themselves are cheap for a vision-based robotic system.

Aiming for more data-efficient learning, an adversarial approach similar to generative adversarial nets (GANs) (Goodfellow et al., 2014) was proposed to learn a classifier for grasping using labeled synthetic and unlabeled real data (Bousmalis et al., 2018). Similar approaches were also proposed for other classification tasks, such as adversarial discriminative domain adaptation (ADDA) for handwritten digit recognition (Tzeng et al., 2017) and incremental adversarial domain adaptation for drivable-path segmentation (Wulfmeier et al., 2018). Some works also tried to use adversarial approaches for regression transfer such as domain confusion with weak pairwise constraints for deep visuo-motor representation adaptation (Tzeng et al., 2016).

In this article, we leverage the ADDA idea (which achieved better performance in handwritten digit recognition than domain confusion (Tzeng et al., 2017)) and extend it from classification to regression transfer for learning visuo-motor policies from simulation to the real world. The approach is verified with modular networks in a visually guided table-top object reaching task for a seven-degree-of-freedom (7-DoF) robotic arm (Figure 1). By introducing an adversarial loss, visuo-motor policies can be successfully transferred from simulated (Figure 1(A)) to real (Figure 1(B)) environments with only 93 labeled and 186 unlabeled real images. Benefiting from the modular structure and weighted end-to-end fine-tuning, the learned visuo-motor policies achieved a reaching accuracy of 1.8 $cm$ with only 333 trajectories (30,225 state–velocity pairs collected in simulation). The learned visuo-motor policies are not only able to reach a target object in clutter with distractor objects (seen in training), but also for the cases with novel (not seen in training) distractor objects and even when the target object is moving. In particular, this article has three major contributions.

Introduction of an adversarial discriminative approach by leveraging the ADDA in a semi-supervised manner for more data-efficient perception transfer from simulation to the real world, achieving a comparable accuracy (2.7 $cm$ ) with 50% fewer labeled real data and a slightly worse accuracy (3.0 $cm$ ) with 75% fewer labeled real data (compared with supervised adaptation: 2.8 $cm$ ).

Further verification of modular neural networks (Zhang et al., 2017a) for sim-to-real transfer of visuo-motor policies in a more realistic robotic reaching task: table-top object reaching in clutter using a 7-DoF arm in velocity mode, achieving a 97.8% success rate and 1.8 $cm$ accuracy.

Investigation of important factors in the adversarial discriminative transfer (ADT) approach with comprehensive comparison experiments and detailed analyses, showing their benefits and limits for future research.

Fig. 1.

A robot (Baxter) learns visuo-motor policies in simulation (A) to control its left arm (7 DoFs) to reach a target blue cuboid in clutter on a table. Baxter visually observes the table-top environments through the monocular RGB camera in its right hand. An adversarial discriminative approach (see Section 3) is used to transfer visuo-motor policies from simulation to the real world (B). The transfer is semi-supervised so needs very few labeled real images.

2. Related work

Data-driven learning approaches have become popular in computer vision and are starting to replace hand-crafted solutions in robotic applications (Sünderhauf et al., 2018). In particular, there have been growing interest in robotic vision tasks, robotic tasks based directly on real image data, such as object grasping and manipulation (Lenz et al., 2015; Levine et al., 2016b; Pinto and Gupta, 2016). An important factor in data-driven robot learning approaches is large-scale datasets, from either the real world or simulation.

2.1. Learning from real datasets

In the real world, collecting the datasets required for deep learning has been sped up by using many robots operating in parallel (Levine et al., 2016b). With over 800,000 grasp attempts recorded, a deep network was trained to predict the success probability of a sequence of motions aiming at grasping using a 7-DoF robotic manipulator with a two-finger gripper. Combined with a simple derivative-free optimization algorithm the grasping system achieved a success rate of 80%. Another example of dataset collection for grasping is the approach to self-supervised grasp learning in the real world where force sensors were used to autonomously label samples (Pinto and Gupta, 2016). After training with 50,000 real-world trials using a staged learning method, a deep convolutional neural network (CNN) achieved a grasping success rate around 70%.

The aforementioned results are impressive but were achieved at high cost in terms of money, space, and time (weeks to months). Aiming for more data-efficient learning, Levine et al. (2016a) introduced a CNN-based policy representation architecture with an added guided policy search (GPS) to learn visuo-motor policies (mapping joint angles and camera images to joint torques) for continuous control, which allows reduction in the number of real-world training examples by converting policy search into supervised learning with trajectory distributions as a bridge. Impressive results were achieved in complex tasks, such as hanging a coat hanger, inserting a block into a toy, and tightening a bottle cap.

2.2. Learning with simulation

Simulation is another resource to reduce the cost of collecting real-world datasets. With domain randomization, policies learned in simulation are robust enough to be directly used on real robots with real RGB cameras observing real scenes in manipulation tasks (James et al., 2017; Tobin et al., 2017). Recently, it has also been proposed to simulate depth images to learn and then directly transfer grasping skills to real-world robotic arms (Viereck et al., 2017).

There are also some negative results, which show that visuo-motor policies learned in a low-fidelity simulator do not transfer directly to real robots with real cameras observing real scenes (Zhang et al., 2015). In fact, very modest image distortions in the simulation environment (small translations, Gaussian noise and scaling of the RGB color channels) caused the performance of the system to fall dramatically. Introducing a real camera observing the game screen was even worse (Tow et al., 2016). However, if adapting with a small number of real images, the visuo-motor policies learned in a low-fidelity simulator can be well transferred to real scenarios for a robotic planar reaching task (Zhang et al., 2017a,b).

2.3. Transfer learning

Transfer learning attempts to develop methods to transfer knowledge between different tasks (scenarios) (Pan and Yang, 2010; Taylor and Stone, 2009). To reduce the amount of data collected in the real world (expensive), transferring skills from simulation to the real world is an attractive alternative. For the case of pre-training in simulation then adapting with very few real-world samples, appropriate transfer learning approaches are required.

To reduce the number of real-world images required for learning visuo-motor policies, a method of adapting visual representations from simulated to real environments was proposed, achieving a success rate of 79.2% in a “hook loop” task, with 10 times fewer real-world images (Tzeng et al., 2016). Another example of vision-based policy transfer is progressive neural networks, which are proposed to improve transfer and avoid catastrophic forgetting when learning complex sequences of tasks (Rusu et al., 2016). Their effectiveness has been validated on reinforcement learning tasks such as Atari and 3D maze game playing as well as simulated robotic manipulation (Rusu et al., 2017).

Similar to GANs (Goodfellow et al., 2014), adversarial approaches were also proposed for domain adaptation in classification contexts such as handwritten digit recognition (Ganin et al., 2016; Ge et al., 2017; Luo et al., 2017; Tzeng et al., 2017), place classification and segmentation (Wulfmeier et al., 2017, 2018), object recognition (Tzeng et al., 2015), and fine-grained recognition (Gebru et al., 2017). An adversarial adaptation approach was also proposed to improve the efficiency of learning a classifier to determine whether a grasp command will be successful or not (Bousmalis et al., 2018). Some works also tried to use adversarial approaches for regression transfer such as a GAN for image style transfer from source to target domains (Bousmalis et al., 2017) and domain confusion for adapting deep visuo-motor representations (Tzeng et al., 2016). These approaches opened a new and attractive direction for more data-efficient learning.

This article mainly extends the ADDA approach (Tzeng et al., 2017) from classification to regression tasks to transfer learned visuo-motor policies from simulated to real-world settings. Unlike the domain confusion method, which leads to representations confusing a discriminator by aiming for a uniform distribution over domain labels (Tzeng et al., 2016, 2015), the ADDA approach encourages a target encoder to have a representation distribution as close as possible to the one from a source encoder, and achieved better performance in handwritten digit recognition (Tzeng et al., 2017).

3. Methodology

In our previous work (Zhang et al., 2017a), a modular structure and its training approach were proposed to transfer visuo-motor policies from simulation to the real world in a low-cost manner. The transfer was achieved by using 1,418 labeled real images to fine-tune a perception module pre-trained in simulation. In this article, we propose a semi-supervised transfer approach to reduce the number of labeled real images required. We call this semi-supervised approach ADT, which mainly benefits from the introduction of an adversarial loss (Tzeng et al., 2017).

3.1. Modular deep networks

Similar to the modular deep Q-network (Zhang et al., 2017a), a modular network architecture (Figure 2) is proposed, which consists of perception and control modules connected by a bottleneck layer. The bottleneck forces the network to learn a low-dimensional representation, not unlike auto-encoders (Hinton and Salakhutdinov, 2006). The difference is that we explicitly equate the bottleneck layer with the object position ( $x^{*} \in R^{3}$ , ignoring orientation).

Fig. 2.

The modular network consisting of perception and control modules, connected by a bottleneck layer representing target object position $x^{*}$ . The perception module architecture is customized from VGG16 (Simonyan and Zisserman, 2015) with its first convolutional layer initialized with weights from pre-trained VGG16. The control module consists of three fully connected layers, it determines joint velocities according to target position and joint angles. The perception and control modules are first trained separately, then fine-tuned in an end-to-end fashion using weighted losses (Section 3.1.1).

With the bottleneck, the perception module learns how to estimate the object position $x^{*}$ from a raw-pixel image I; the control module learns to determine the most appropriate joint velocities $v$ given the object position $x^{*}$ and joint angles $q$ (defined as scene configuration $Θ = [x^{*}, q]$ ). The values of $x^{*}$ and $q$ are normalized to the interval $[0, 1]$ .

3.1.1. Training method

Perception The perception module is first pre-trained using labeled simulated data with a supervised loss $L_{p}^{Sup}$ . Then it is adapted using both simulated and real data with a compound loss

L_{p} = L_{p}^{Sup} + L_{p}^{Ad}

(1)

where $L_{p}^{Ad}$ is an adversarial loss. Definitions of the loss functions are introduced in Section 3.2. We call this perception training approach ADT.

Control The control module is trained using supervised learning with only simulated data

L_{c} = \frac{1}{2 m} \sum_{j = 1}^{m} ‖ y_{c} (s_{j}) - v_{j} ‖^{2}

(2)

where $y_{c} (s_{j})$ is the prediction of joint velocity $v_{j}$ for state $s_{j}$ , here $s = Θ$ (ground-truth); m is the number of samples.

End-to-end fine-tuning using weighted losses To further improve hand–eye coordination, an end-to-end fine-tuning is conducted for the combined network (perception + control) after their separate training, using weighted control ( $L_{c}$ ) and perception ( $L_{p}$ ) losses. Note that $s = I$ (raw-pixel images) in Equation (2) for the end-to-end fine-tuning, rather than $Θ$ . The control module is updated using only $L_{c}$ , while the perception module is updated using the weighted loss

L = β L_{p} + (1 - β) L_{c}^{BN}

(3)

where $L_{c}^{BN}$ is a pseudo-loss, which reflects the loss of $L_{c}$ in the bottleneck; $β \in [0, 1]$ is a balancing weight. From the backpropagation algorithm (LeCun, 1988), we can infer that $δ_{L} = β δ_{L_{p}} + (1 - β) δ_{L_{c}^{BN}}$ , where $δ_{L}$ is the gradient resulting from L; $δ_{L_{p}}$ and $δ_{L_{c}^{BN}}$ are the gradients resulting from $L_{p}$ and $L_{c}^{BN}$ , respectively (equivalent to that resulting from $L_{c}$ in the perception module).

3.2. ADT

ADT makes use of both adversarial and supervised losses to adapt a perception module with fewer labeled real images. In ADT, the perception module is divided into two parts: encoder and regressor. As shown in Figure 3, the encoder includes all the convolutional layers in a perception module; the regressor represents all the fully connected layers of the perception module.

Fig. 3.

In ADT, the perception module is divided into two parts: encoder and regressor. The encoder includes all the convolutional layers; the regressor represents all the remaining fully connected layers. We first pre-train a perception module (source encoder + source regressor) with $L_{p}^{Sup}$ using simulated images ( $I^{S}$ ) and their target object position labels ( ${x^{*}}^{S}$ ). The source encoder is then locked and used as a reference in the ADT to train a target encoder $E_{r}$ with $L_{p}^{Ad}$ using both simulated ( $I^{S}$ ) and real ( $I^{R}$ ) images without labels. In addition to the adversarial loss, $L_{p}^{Sup}$ is also used to train the target encoder and regressor with a small number of labeled real images ( $I^{R}$ and ${x^{*}}^{R}$ ). The target encoder and regressor are initialized with the weights in the source encoder and regressor. The discriminator consists of multiple fully connected layers.

A perception module (source encoder + source regressor) is first pre-trained with simulated images ( $I^{S}$ ) and their target object position labels ( ${x^{*}}^{S}$ ), using the supervised loss

L_{p}^{Sup} = \frac{1}{2 m} \sum_{j = 1}^{m} ‖ y_{p} (I_{j}) - {x^{*}}_{j} ‖^{2}

(4)

where $y_{p} (I_{j})$ is the prediction of ${x^{*}}_{j}$ for $I_{j}$ . Here in the pre-training $I = I^{S}$ , $x^{*} = {x^{*}}^{S}$ . The physical meaning of $x^{*}$ guarantees the convenience of collecting labeled training data.

The source encoder is then locked and used as a reference in the ADT to train a target encoder with both simulated ( $I^{S}$ ) and real ( $I^{R}$ ) images, but without labels, using an adversarial loss

L_{p}^{A d} = L_{D}^{A d} + γ L_{E}^{A d}

(5)

L_{D}^{A d} = \frac{1}{2 m} \sum_{j = 1}^{m} [\log D (E_{s} (I_{j}^{S})) + \log (1 - D (E_{r} (I_{j}^{R})))]

(6)

L_{E}^{Ad} = - \frac{1}{m} \sum_{j = 1}^{m} \log D (E_{r} (I_{j}^{R}))

(7)

where $γ$ is a balancing weight; D represents the discriminator; $E_{s}$ and $E_{r}$ are the source and target encoders in Figure 3. With $L_{D}^{Ad}$ , the discriminator (D) learns to distinguish which domain an encoded feature comes from: simulation or real world, i.e., $\arg \min_{D} L_{D}^{Ad}$ . Here $L_{E}^{Ad}$ leads the target encoder ( $E_{r}$ ) to be as similar as possible to the source encoder to confuse the discriminator, i.e., $\arg \min_{E_{r}} L_{E}^{Ad}$ .

Experimental results in Section 5.2 show that a single adversarial loss ( $L_{p}^{Ad}$ ) is insufficient for the sim-to-real transfer of visuo-motor policies. Therefore, in addition to the adversarial loss, the supervised loss $L_{p}^{Sup}$ (Equation (4)) is also used in the transfer phase to train the target perception module (encoder and regressor) with a small number of labeled real images ( $I^{R}$ and ${x^{*}}^{R}$ ), i.e., $I = I^{R}$ , $x^{*} = {x^{*}}^{R}$ . The target perception module is initialized with the pre-trained weights from the source perception module.

Experimental results also show that maintaining $L_{D}^{Ad}$ in a certain range (0.26–0.30) helps improve the transfer performance (Section 5.3.3). Therefore, a PI controller is proposed to control $L_{D}^{Ad}$ to a desired value by changing the balancing weight $γ$ , as shown in Figure 4. In the ADT (Equation (5)), a larger $γ$ will result in a stronger effect of $L_{E}^{Ad}$ , which will then more strongly prevent $L_{D}^{Ad}$ from being smaller or even cause a larger $L_{D}^{Ad}$ . Similarly, a smaller $γ$ will help result in a smaller $L_{D}^{Ad}$ . The controller output u is mapped to the balancing weight $γ$ through a sigmoid function $γ = \frac{0.02}{1 + e^{- 50 u}}$ . The sigmoid function is selected empirically according to three major concerns:

$γ$ cannot be too large, in order to avoid catastrophic weight forgetting;

the value of $γ$ when $u = 0$ should be able to roughly guarantee an unchanged $L_{D}^{Ad}$ , providing a symmetric u-to-action-effect mapping;

$γ$ should not be zero, since the true business of $L_{E}^{Ad}$ is to create a good target encoder; although making $γ$ negative might better help reduce $L_{D}^{Ad}$ , it is harmful to the true role of $L_{E}^{Ad}$ .

Fig. 4.

A PI controller is used to control $L_{D}^{Ad}$ to a desired value (desired $L_{D}^{Ad}$ ). The controller output u is mapped to the balancing weight $γ$ through a sigmoid function. In the ADT process (Equation (5)), a larger $γ$ will result in a stronger effect of $L_{E}^{Ad}$ which will then more strongly prevent $L_{D}^{Ad}$ from being smaller or even cause a larger $L_{D}^{Ad}$ . Similarly, a smaller $γ$ will help result in a smaller $L_{D}^{Ad}$ .

Our tuned coefficients for proportional and integral gains are $K_{p} = 0.4$ and $K_{i} = 0.008$ , respectively. To solve the integral windup problem, pre-determined bounds are used to prevent the integral term from accumulating above 0.1 or below −0.1, i.e., $[- 0.1, 0.1]$ .

4. Benchmark: robotic reaching

We use a canonical target reaching task as a benchmark to evaluate the effectiveness of the proposed approach. The task is defined as controlling a robot arm so that its end-effector position $x \in R^{3}$ in operational space moves to the position of a target $x^{*} \in R^{3}$ (object position introduced in Section 3.1). The robot’s joint configuration is represented by its joint angles $q \in R^{n}$ . The two spaces are related by the forward kinematics, i.e., $x = K (q)$ . The reaching controller adjusts the robot configuration in velocity mode (i.e., controls joint velocities $v = \overset{\cdot}{q} \in R^{n}$ ) to minimize the error between the robot’s current and target position, i.e., $‖ x - x^{*} ‖$ . We consider a 7-DoF robotic arm (Figure 1), i.e., $q, v \in R^{7}$ , steering its end-effector position in three dimensions, ignoring orientation.

4.1. Task setup

The real-world task employs a Baxter robot’s left arm (7 DoFs) to reach a blue cuboid in clutter. All objects are arbitrarily placed in the operational area (50 $cm$ × 60 $cm$ ) on a table, as shown in Figure 5(A). The blue cuboid has a side length of 6.5 $cm$ . The robot observes environments through a monocular RGB camera in its right hand (Figure 1(A)), providing RGB images with a resolution of 256×256 (cropped from 640×400 images). The left arm is controlled in velocity mode. A reach is deemed successful, if the Euclidean distance between the top center of the target cuboid and the bottom center of the suction gripper (“Top Center” and “Bottom Center” in Figure 5) is smaller than 4.6 $cm$ (half of the diagonal length of any side of the cuboid). In the task, the left arm is randomly initialized to a configuration with a normal distribution around the reference configuration shown in Figure 5B. The right arm is set to a constant pose, i.e., camera pose is constant with possible minor errors (Baxter joint accuracy: ±0.10°) in the real world.

Fig. 5.

A Baxter robot controls its left arm in velocity mode to reach a blue cuboid ( $6.5 cm \times 6.5 cm \times 6.5$ ) in clutter arbitrarily placed in the operational area. The “Top Center” and “Bottom Center” are the top center of the target cuboid and the bottom center of the suction gripper. Figure 5(B) shows the left arm in its reference initial configuration.

4.2. Network architecture

In this work, we used a network with the architecture shown in Figure 2. The perception module has an architecture customized from VGG16 (Simonyan and Zisserman, 2015). The customization mainly includes reducing the number of convolutional layers in each group (between two max pooling layers) and changing the number of feature maps in each convolutional layer for lower computational cost but without losing performance for the benchmark task. It consists of 12 convolutional layers with 3×3 filters and seven 2×2 max pooling layers, followed by three fully connected layers. The twelve convolutional layers and two hidden fully connected layers use rectified linear unit (ReLU) activation. Simulated or real RGB images are cropped and down-sampled to $256 \times 256$ as inputs to the perception module. The pixel values in images were normalized to $[- 1, 1]$ . The first convolutional layer is initialized with pre-trained weights for ILSVRC-2014 (Simonyan and Zisserman, 2015) (observed to converge faster and achieve better performance than random initialization); other layers are randomly initialized.

The control module consists of 3 fully connected layers, with 400 and 300 units in the 2 hidden layers (with ReLU activation), respectively. Input to the control module is the scene configuration $Θ$ (target position and joint angles), its outputs are the estimates for joint velocities $v$ . This module is initialized with random weights.

The discriminator network consists of 3 fully connected layers with 256 units in each of its 2 hidden layers (also with ReLU activation). Input to the discriminator is an encoded feature vector with a dimension of 256, either from the source encoder or target encoder. The output layer has two units (two classes: simulated or real) with softmax activation. The discriminator is also randomly initialized.

4.3. Datasets collection

Perception datasets contain a number of image–position (I– $x^{*}$ ) pairs. In this work, we label the position of the target cuboid top center as the target position $x^{*}$ rather than its center of mass. Figure 6 shows some samples of the collected simulated and real images for the benchmark task. The simulated data was collected using V-REP (Rohmer et al., 2013) (a robotic simulation platform) through domain randomization (Tobin et al., 2017) in the following aspects:

number of distractor objects in clutter, random in $[0, 9]$ ;

shape of distractor objects in clutter, random in nine primitive shapes with different geometries (five cuboids, two spheres, two cylinders);

pose of distractor objects, random position in the operational area and random orientation about the vertical axis;

color of distractor objects, random RGB values;

left arm configuration, random in joint space, excluding those with self-collision;

color of the table, floor, robot body, and target cuboid, random changes based on reference colors ( $\pm 10 %$ );

camera pose, random changes of the right arm joint configuration relative to reference angles ( $\pm 1 %$ );

camera field of view (FoV), random changes based on a reference FoV ( $\pm 2 %$ );

table pose, random changes based on a reference position ( $[\pm 1.5 %, \pm 5 %, \pm 1 %]$ ) and a reference orientation about the vertical axis ( $\pm 7 %$ ).

Fig. 6.

Simulated and real images for training perception modules. Simulated images were collected from a V-REP simulator using domain randomization (Tobin et al., 2017). Real images were collected for perception adaptation on a real Baxter (Figure 1(B)).

All the above randomization is distributed uniformly. The reference colors, FoV, and table pose were tuned manually to approximate the real scene. The reference joint angles of the right arm (i.e., camera pose) were tuned in the real world, making sure the in-hand camera can see the entire operational area. The parameters for the randomized factors based on references were manually tuned to simulate possible variations in the real scene.

The real images shown in Figure 6 were collected in the real world on a Baxter robot (Figure 1(B)) with random objects and left arm configurations. There are 11 real distractor objects in total. The ground-truth position of the target blue cuboid was collected by putting the end-effector bottom center on the cuboid top center and recording the left arm configuration (target configuration $q^{*}$ ) for forward kinematics, i.e., $x^{*} = K (q^{*})$ . The ground-truth position collected in this way is accurate enough for the benchmark task, although some errors might be caused by manually matching the end-effector with the cuboid. This ground-truth position collection method was also used in the control performance evaluation in Section 5.

More formally, we use $Z_{Sup}^{S} (N) = {I_{i}^{S}, {x^{*}}_{i}^{S}}_{i = 0}^{N}$ to represent a perception dataset of N labeled simulated images. Similarly, $Z_{Sup}^{R} (N) = {I_{i}^{R}, {x^{*}}_{i}^{R}}_{i = 0}^{N}$ represents a perception dataset of N labeled real images. Apart from the labeled real images, we also collected real images without labels for the ADT, represented as $Z_{Ad}^{R} (N) = {I_{i}^{R}}_{i = 0}^{N}$ .

In training, to increase the training data diversity, data augmentation is done on the fly for both simulated and real images by varying image brightness ( $\pm 80 %$ for simulated images and $\pm 40 %$ for real images) and white balance ( $\pm 2.5 %$ ) in a post-processing manner. These augmentation parameters were determined empirically.

Control datasets contain a number of scene configuration–velocity ( $Θ$ – $v$ ) pairs (i.e., trajectories) as well as image–velocity (I– $v$ ) and image–position (I– $x^{*}$ ) pairs. The $Θ$ – $v$ pairs are for training control modules separately (Section 3.1.1); the I– $v$ and I– $x^{*}$ pairs are for end-to-end fine-tuning to obtain $δ_{L_{c}}$ and $δ_{L_{p}}$ (Section 3.1.1).

Control datasets were purely collected in simulation using V-REP, represented as $Z_{c}^{S} (N_{T}; N) = {I_{i}^{S}, x_{i}^{* S}, Θ_{i}^{S}, v_{i}^{S}}_{i = 0}^{N}$ where $N_{T}$ indicates the number of trajectories in a dataset and N is the number of samples (frames in trajectories). In dataset collection, trajectories were generated to control the left arm with a random initial configuration (excluding those with self-collision) to reach a target arbitrarily placed in the operational area, without considering obstacle avoidance. As introduced in Section 4.1, the random initial configuration has a normal distribution around the reference configuration shown in Figure 5(B); the random targets are distributed uniformly in the operational area. When generating the trajectories, the pseudo-inverse method (V-REP internal implementation) was used to calculate the desired arm configuration to reach a target, i.e., $q^{*} = K^{- 1} (x^{*})$ . Then a simple proportional controller was used to control the left arm to reach the desired configuration from its initial configuration with a control frequency of 20 $Hz$ . In the process, the target cuboid position, joint angles, and velocity commands were recorded, along with synthetic images from the camera in the right hand. Experiments (Section 5.4) show that simulated control training data is sufficient to achieve good real-world performance alone: there is no need to collect real control datasets.

For comparison experiments in Section 5, we collected 11 perception (3 labeled simulated, 4 labeled real, and 4 unlabeled real) and 3 control datasets, as listed in Table 1. The datasets and code are available at https://github.com/Fanleyrobot/ADT.

Table 1.

Collected datasets.

Simulated perception datasets	$Z_{Sup}^{S} (340)$ , $Z_{Sup}^{S} (750)$ , $Z_{Sup}^{S} (3, 000)$
Real perception datasets with labels	$Z_{Sup}^{R} (48)$ , $Z_{Sup}^{R} (93)$ , $Z_{Sup}^{R} (186)$ , $Z_{Sup}^{R} (279)$
Real perception datasets without labels	$Z_{Ad}^{R} (48)$ , $Z_{Ad}^{R} (93)$ , $Z_{Ad}^{R} (186)$ , $Z_{Ad}^{R} (279)$
Control datasets	$Z_{c}^{S} (118; 10, 677)$ , $Z_{c}^{S} (333; 30, 225)$ , $Z_{c}^{S} (2, 964; 269, 851)$

5. Experiments and results

We first evaluated the performance of supervised perception adaptation as a baseline. The performance of the proposed approach was then evaluated in three aspects: adversarial discriminative perception adaptation performance, control module performance, and hand–eye coordination. The important factors in ADT were also investigated with detailed comparison experiments. All the evaluations were conducted in the real world using the following metrics.

Perception error: the Euclidean distance between the estimated and ground-truth object positions.

Control error: the Euclidean distance between the target cuboid top center and end-effector bottom center (“Top Center” and “Bottom Center” in Figure 5(A)).

Success rate: the percentage of successful reaching among all trials, where a reach is deemed successful if the final Euclidean distance between the target and end-effector (after the robot stops or its time is out) is smaller than 4.6 $cm$ as defined in Section 4.1.

5.1. Supervised perception adaptation

Supervised adaptation is a commonly used approach in deep learning for knowledge transfer between different domains. Here, we used its performance as a baseline to compare with the proposed ADT approach. To investigate the influence of the numbers of simulated and real images on adapted perception accuracy, we evaluated 15 different perception modules. They were trained with different combinations of labeled images:

the number of labeled simulated images is from 0 to 3,000 (i.e., $Z_{Sup}^{S} (340)$ , $Z_{Sup}^{S} (750)$ , and $Z_{Sup}^{S} (3, 000)$ );

the number of labeled real images is from 0 to 279 (i.e., $Z_{Sup}^{R} (93)$ , $Z_{Sup}^{R} (186)$ , and $Z_{Sup}^{R} (279)$ ).

As introduced in Section 3.1.1, all 15 perception modules were first trained using simulated images then adapted with real images, but only using the supervised loss $L_{p}^{Sup}$ without the adversarial loss. The training was from scratch, except that the first convolutional layer was initialized with weights from pre-trained VGG16 (Simonyan and Zisserman, 2015) During training, we used a mini-batch size of 32 with a learning rate of 0.01. RmsProp (Tieleman and Hinton, 2012) was adopted, the same training method was used in the experiments for ADT (Section 5.2), control modules (Section 5.4), and end-to-end fine-tuning (Section 5.5). The median and third quartile (Q3) of their perception errors for a test set are shown in Figure 7. The test set has 144 real images where the target is distributed uniformly in the operational area, with random distractor objects (those 11 distractor objects appeared in training) and left arm configurations. The test set was collected with the same setup for training set but different from those samples for training.

Fig. 7.

Object position estimation error map for supervised adaptation. The numbers in the map show the median and third quartile (Q3) of the Euclidean distances between predicted and ground-truth positions. “N/A” means no result for that case.

From Figure 7, we can see that the perception modules trained with only simulated (the left-most column) or real images (the bottom row) have very large errors. For the modules trained with both simulated and real images, increasing the number of either simulated or real images helped reduce the error. Fine-tuning (adaptation) with as few as 93 real images can make a perception module work in the real world with a median error of 3.9 $cm$ . The module trained with 3,000 simulated and 279 real images (top-right) achieved the smallest median error (2.4 $cm$ ). However, trading off the accuracy and the used number of real images, the module trained with 3,000 simulated and 186 real images is the most balanced, labeled as $P_{s 1}$ . It has a median error of 2.8 $cm$ , which is 17% larger than the best, but needs only 67% of the real images. The module $P_{s 0}$ , which was trained with 3,000 simulated samples, was used as a source perception module in the evaluation of the proposed ADT approach (Section 5.2).

To study how much the on-the-fly data-augmentation method (Section 4.3) can help improve the perception accuracy, we trained a perception module using 3,000 simulated and 186 real images without data augmentation. It achieved a median error of 3.1 $cm$ (Q3: 4.4 $cm$ ), which is 11% larger than $P_{s 1}$ . This shows that the data augmentation did help improve the perception accuracy.

5.2. ADT

In this section, we evaluated the perception modules trained by the proposed ADT approach using the same test set. 16 modules were trained using ADT to investigate how the amount of labeled and unlabeled real images influences the adaptation performance. They were adapted with different combinations of real images:

the number of labeled real images from 0 to 186 (i.e., $Z_{Sup}^{R} (48)$ , $Z_{Sup}^{R} (93)$ , and $Z_{Sup}^{R} (186)$ );

the number of unlabeled real images from 0 to 279 (i.e., $Z_{Ad}^{R} (48)$ , $Z_{Ad}^{R} (93)$ , $Z_{Ad}^{R} (186)$ , and $Z_{Ad}^{R} (279)$ ).

All 16 perception modules were adapted using the adversarial loss (Equation (5)) from the same module $P_{s 0}$ , which was pre-trained with 3,000 simulated images in Section 5.1 (equivalent to the pre-training phase of ADT). The target encoders and regressors of the 16 perception modules were initialized with the weights of $P_{s 0}$ . The encoder part of $P_{s 0}$ also worked as the reference source encoder in the ADT.

In the transfer phase, we used a constant learning rate of 0.001 and a mini-batch size of 32. In particular, 32 simulated (from $Z_{Sup}^{S} (3000)$ ) and 32 unlabeled real images (from $Z_{Ad}^{R} (N)$ ) were used to calculate $L_{D}^{Ad}$ in each transfer step; and the same 32 unlabeled real images were also used to calculate $L_{E}^{Ad}$ ; then another 32 labeled real images (from $Z_{Sup}^{R} (N)$ ) were used to calculate $L_{p}^{Sup}$ . The desired discriminative loss $L_{D}^{Ad}$ was set to 0.28. The other hyper-parameters are the same as in Section 5.1.

Figure 8 shows the performance of the perception modules adapted with different numbers of unlabeled and labeled real images. The bottom row shows the results for the modules adapted without unlabeled real images, i.e., supervised adaptation (three of them have appeared in the top row of Figure 7, except the one adapted with 48 labeled real images). The results for the cases without labeled real images (i.e., unsupervised adaptation, $L_{p} = L_{p}^{Ad}$ ) are shown in the left-most column, from which we can observe that modules adapted with more unlabeled real images have smaller errors, but marginal improvement after more than 186 images. The poor accuracy ( $\geq 16.7$ $cm$ ) of perception modules adapted without labeled real data indicates that a single adversarial loss ( $L_{p}^{Ad}$ ) is insufficient for the sim-to-real transfer of visuo-motor policies.

Fig. 8.

Object position estimation error map for the ADT approach. The x-axis shows the number of labeled real images used; the y-axis shows the number of unlabeled real images. Note: the bottom row shows the cases without unlabeled real images (i.e., supervised adaptation, effectively the top row in Figure 7); but different from Figure 7, the cases with 48 rather than 279 labeled real images were evaluated here to better observe how the ADT approach would work with fewer annotated real images.

The other results are for the cases with both labeled and unlabeled real images (i.e., semi-supervised adaptation). We can see that the modules adapted with more labeled images have smaller errors, but the improvement is non-obvious after more than 93 labeled samples. Similarly, more unlabeled real images also resulted in smaller errors. However, performance became worse if the number of unlabeled images was more than twice the number of labeled samples (e.g., the modules adapted with 48 labeled and more than 93 unlabeled real images, as well as the module adapted with 93 labeled and 279 unlabeled samples) or fewer than half the number (e.g., the modules adapted with 186 labeled and 48 unlabeled real images). This might be because large differences between unlabeled and labeled data make their distributions differ a lot, which then result in worse adaptation. More investigation is necessary in the future to make better use of unlabeled real data, enabling performance improvement for the cases with two times more unlabeled data than labeled ones.

The best performance was achieved by the modules adapted with 186 labeled and 186 or 279 unlabeled real images. However, trading off the accuracy and the number of labeled real images (expensive), the module adapted with 93 labeled and 186 unlabeled real images is the best one, labeled as PP. It has a slightly larger error than the best, but needs 50% fewer labeled real images.

By comparing the bottom row (supervised adaptation) with the other rows (ADT), we can see that the benefit of the adversarial loss was significant, particularly for the cases with very few labeled samples, e.g., the modules adapted with 48 labeled real images (more than 85% improvement, perception errors reduced from 28.0 $cm$ to less than 4.2 $cm$ ). In contrast, the benefit of the adversarial loss was trivial when adapting with 186 labeled samples.

5.3. Important factors in ADT

To further investigate the effectiveness and robustness of the proposed ADT approach, we conducted some comparison experiments in four different aspects.

How robust is ADT to different random seeds in training?

How effective is the PI controller?

How does the desired discriminative loss for the PI controller affect the adaptation performance?

How does the capacity of a discriminator network affect the adaptation performance?

In these comparison experiments, all perception modules were trained using the same conditions for PP, i.e., 3,000 labeled simulated images ( $Z_{Sup}^{S} (3, 000)$ ), 93 labeled ( $Z_{Sup}^{R} (93)$ ) and 186 unlabeled ( $Z_{Ad}^{R} (186)$ ) real images. The training hyper-parameters other than the comparing one were the same as for PP. Performances were evaluated using the same test set that was used in Sections 5.1 and 5.2.

5.3.1 Robustness to different random seeds

To evaluate the robustness of the proposed ADT approach, five perception modules were trained using different random seeds, i.e., seeds 1 to 5. Seed 1 is the one used for PP. Figure 9 shows their estimation errors in the box-plot form, from which we can see that their median errors were between 2.6 and 2.8 $cm$ , with slightly different distributions. These results indicate that the ADT approach is robust to different random seeds in training.

Fig. 9.

Box-plots of the Euclidean distances between predicted and ground-truth positions for perception modules adapted using the ADT approach with different random seeds. The crosses represent outliers, the numbers show the medians. The outliers are those ≥Q3+w(Q3-Q1) or ≤Q1-w(Q3-Q1), where Q1 and Q3 are the first and third quartiles; $w = 1.5$ .

5.3.2. Effectiveness of the PI controller

To see the benefit of the PI controller, a module was adapted using the adversarial discriminative loss but without the PI controller, i.e., $γ = 1$ in Equation (5). In addition, we also compared the adversarial discriminative loss to another form of adversarial loss: confusion loss (Tzeng et al., 2015), where Equation (7) was replaced by

\begin{matrix} L_{E}^{Ad} = - \frac{1}{2 m} \sum_{j = 1}^{m} [\frac{1}{2} \log D (E_{s} (I_{j}^{S})) + \frac{1}{2} \log (1 - D (E_{s} (I_{j}^{S}))) + \frac{1}{2} \log D (E_{r} (I_{j}^{R})) + \frac{1}{2} \log (1 - D (E_{r} (I_{j}^{R})))] \end{matrix}

(8)

of which the weights in source and target encoders were shared, i.e., $E_{s} = E_{r}$ . Comparison was also made with a module adapted without any adversarial loss.

Figure 10 compares the results, from which we can see that the domain confusion approach (DCT) has much larger errors than ADT, either with or without the PI controller. The approaches with the PI controller achieved better performances than those without. In particular, ADT has a 13% smaller median perception error than ADT without PI; DCT’s median error is 35% smaller than that of DCT without PI. These results show that the PI controller did help improve the adaptation no matter using an adversarial discriminative loss or domain confusion loss.

Fig. 10.

Box-plots for perception modules adapted using different adversarial losses with/without the PI controller. DCT denotes a perception module adapted using the domain confusion approach where Equation (7) was replaced by Equation (8). Without ADT represents a perception module adapted without any adversarial loss, i.e., supervised adaptation (the top-second-left in Figure 7).

From the comparison with the approach adapted without any adversarial loss (Without ADT), we can see that DCT has even larger errors than Without ADT, while ADT’s errors are smaller. This indicates that the adversarial discriminative loss worked better than the domain confusion loss for our case. Note that the results of the DCT approaches in Figure 10 are those under the same condition as the ADT ones for fair comparison; broader qualitative experiments indicate that the domain confusion loss can bring some performance improvement (slightly better than supervised adaptation), but it was less significant than the adversarial discriminative loss.

5.3.3. Appropriate desired $L_{D}^{Ad}$ for the PI controller

To investigate how the desired discriminative loss $L_{D}^{Ad}$ for the PI controller affects the adaptation performance, eight perception modules were adapted with different desired $L_{D}^{Ad}$ . Their estimation errors are shown in Figure 11. We can see that the modules with goals between 0.26 and 0.30 have very similar performances. The others outside the interval have larger errors: the smaller or larger the desired loss is, the larger the perception error is. This shows that too large or small desired $L_{D}^{Ad}$ could cause worse adaptation, while setting the desired loss to a certain range (0.26–0.30) helps achieve good perception adaptation.

Fig. 11.

Box-plots for perception modules adapted using the ADT approach with different desired discriminative losses for the PI controller.

5.3.4. Appropriate discriminator networks

To study how discriminator network architecture could affect the adaptation performance, we adapted eight perception modules with different discriminator networks as follows (numbers in brackets represent the number of units in each layer):

Net 1, 2 hidden layers, $(32, 32)$ ;

Net 2, 2 hidden layers, $(64, 64)$ ;

Net 3, 2 hidden layers, $(128, 128)$ ;

Net 4, 2 hidden layers with units $(256, 256)$ , that used in other experiments;

Net 5, 2 hidden layers, $(512, 512)$ ;

Net 6, 3 hidden layers, $(256, 256, 256)$ ;

Net 7, 4 hidden layers, $(256, 256, 256, 256)$ ;

Net 8, 5 hidden layers, $(256, 256, 256, 256, 256)$ .

Figure 12 shows the errors of the eight perception modules. We can see that Nets 3–8 have similar perception errors with a median error of either 2.6 or 2.7 $cm$ . Nets 1 and 2 have larger errors, among which Net 1 is the worst. These results indicate that the discriminator network architecture plays an important role in the ADT approach. A discriminator with too few units in hidden layers (Nets 1 and 2) has insufficient capacity to well distinguish the differences between simulated and real domains, therefore it cannot provide enough guidance for a target encoder to be as similar as possible to a source encoder. For our case, a network wider or deeper than Net 3 (including Net 3) is sufficient, and further widening or deepening it makes no real difference.

Fig. 12.

Box-plots for perception modules adapted using the ADT approach with different discriminator network architectures.

5.4. Control module performance

To investigate how many trajectories are sufficient for training a control module, we evaluated three control modules trained with different control datasets which have varying numbers of trajectories: 118, 333, and 2,964 (i.e., $Z_{c}^{S} (118; 10, 677)$ , $Z_{c}^{S} (333; 30, 225)$ , and $Z_{c}^{S} (2, 964; 269, 851)$ ). As introduced in Section 4.3, the trajectories in each dataset were all collected in simulation (therefore cheap) for targets uniformly distributed in the operational area. In training, we used a mini-batch size of 64 and a learning rate decreasing from 0.01 to 0.001 with respect to training steps. The metrics of control error and success rate were used in the evaluation. Their performance in 45 real-world reaching trials are shown in Figure 13. The 45 trials were for 15 targets (3 trials for each target) uniformly distributed in the operational area, with random initial left arm configurations (normally distributed around the reference configuration in Figure 5(B)).

Fig. 13.

Control performance curve that shows the median (red square), first quartile (Q1, lower bar), and third quartile (Q3, upper bar) of the Euclidean distances between the target and end-effector. Three control modules were evaluated in 45 real trials. They were trained with different numbers of simulated trajectories (the numbers in brackets). Their success rates are also listed.

From Figure 13, we can see that a control module trained with more trajectories is able to achieve a better control performance in terms of both control error and success rate. The control module trained with 118 trajectories has a success rate of 80%; the other two are 100%. It also has a much larger control error than the other two. This indicates that 118 trajectories are too few to obtain a good control module. The module trained with 2,964 trajectories achieved a slightly smaller control error (0.9 $cm$ , Q3:1.2 $cm$ ) than that (1.0cm, Q3:1.7cm) trained with much fewer trajectories (333). This shows that 333 trajectories are sufficient to obtain a reasonably good control module. Trading off the performance and number of trajectories, we pick the control module trained with 333 trajectories to compose the network for end-to-end reaching in Section 5.5, labeled as CC.

As a comparison, we also evaluated the pseudo-inverse method (which was used to collect trajectory samples) in the real world, using joint angles and target position (not images) as inputs. It has a median control error of 0.2cm (Q3: 0.5 cm), which is smaller than the three trained control modules. However, the control error of CC is small enough for our experiments, because it is much smaller than the perception error of $P_{s 1}$ and PP (i.e., the control performance will not be the end-to-end performance bottleneck). Our future work will try to use reinforcement learning to further improve the control performance, but we focus on policy transfer in this article.

5.5 Hand–eye coordination

To further improve hand–eye coordination, we proposed an end-to-end fine-tuning approach using weighted losses. To evaluate the effectiveness of the approach, we compare five combined networks and a baseline.

Baseline: composed of $P_{s 1}$ and the pseudo-inverse method used to collect trajectory samples.

EE0: composed of $P_{s 1}$ and CC, directly connected after separate training without end-to-end fine-tuning.

EE1: EE0 end-to-end fine-tuned naively, only using the control loss $L_{c}$ .

EE2: EE0 fine-tuned using the proposed approach with weighted losses, without $L_{p}^{Ad}$ (i.e., $L_{p} = L_{p}^{Sup}$ ).

EE3: composed of PP and CC, directly connected after separate training without end-to-end fine-tuning.

EE4: EE3 fine-tuned using the proposed approach with weighted losses, with $L_{p}^{Ad}$ (i.e., $L_{p} = L_{p}^{Sup} + L_{p}^{Ad}$ ).

The detailed end-to-end fine-tuning settings for EE1, EE2, and EE4 are listed in Table 2. In the naive fine-tuning for EE1, extra velocity labels were collected for the real images in $Z_{Sup}^{R} (186)$ . The datasets used in the weighted fine-tuning for EE2 and EE4 are those used for their component perception and control modules, i.e., no extra dataset is required for the our weighted end-to-end fine-tuning approach.

Table 2.

End-to-end fine-tuning settings.

Fine-tuning case	Datasets	Detailed settings
Naive end-to-end fine-tuning for EE1	$Z_{c}^{S} (333; 30, 225)$ $Z_{Sup}^{R} (186)$	Fine-tuned EE0 using $L_{c}$ (Equation (2)) in an end-to-end fashion with $s = I$ . A learning rate of 0.01 and a mini-batch size of 64 were used. The 186 labeled real images (from $Z_{Sup}^{R} (186)$ ) for $P_{s 1}$ were used here with velocity labels obtained using the same method for control datasets in Section 4.3. Similar to the training of $P_{s 1}$ , 87.5% samples in a mini-batch were real; the simulated samples were from $Z_{c}^{S} (333; 30, 225)$ .
Weighted end-to-endfine-tuning without $L_{p}^{Ad}$ for EE2	$Z_{c}^{S} (333; 30, 225)$ $Z_{Sup}^{R} (186)$	End-to-end fine-tuned EE0 using the weighted loss L (Equation (3)) with $L_{p} = L_{p}^{Sup}$ , $β = 0.9$ . A learning rate of 0.01 was used with a mini-batch size of 8 and 64 for $L_{c}$ and $L_{p}$ , respectively. In each fine-tuning step, 8 random image–velocity pairs from $Z_{c}^{S} (333; 30, 225)$ were used to obtain $δ_{L_{c}}$ ; its image–position pairs were used with labeled real images (from $Z_{Sup}^{R} (186)$ ) to obtain $δ_{L_{p}^{Sup}}$ . Similar to the training of $P_{s 1}$ , 87.5% samples real in a mini-batch for $L_{p}^{Sup}$ , i.e., 56 real and 8 simulated samples.
Weighted end-to-endfine-tuningwith $L_{p}^{Ad}$ for EE4	$Z_{c}^{S} (333; 30, 225)$ $Z_{Sup}^{S} (3, 000)$ $Z_{Sup}^{R} (93)$ $Z_{Ad}^{R} (186)$	End-to-end fine-tuned EE3 using the weighted loss L (Equation (3)) with $L_{p} = L_{p}^{Sup} + L_{p}^{Ad}$ (Equation (1)), $β = 0.9$ . A learning rate of 0.001 was used with a mini-batch size of 16 and 32 for $L_{c}$ and $L_{p}^{Sup}$ , respectively. In each fine-tuning step, 16 image–velocity pairs from $Z_{c}^{S} (333; 30, 225)$ were used to obtain $δ_{L_{c}}$ ; its image–position pairs were used with labeled real images (from $Z_{Sup}^{R} (93)$ ) to obtain $δ_{L_{p}^{Sup}}$ . In a mini-batch for $L_{p}^{Sup}$ , 50% samples were real ones, i.e., 16 labeled real and 16 labeled simulated images. The adversarial loss $L_{p}^{Ad}$ (Equation (5)) was calculated in a more complex way. In particular, 32 simulated images (the same 16 samples from $Z_{c}^{S} (333; 30, 225)$ and 16 more from $Z_{Sup}^{S} (3, 000)$ ) and 16 unlabeled real images (from $Z_{Ad}^{R} (186)$ ) were used to calculate $L_{D}^{Ad}$ ; and the same 16 unlabeled real images were used to calculate $L_{E}^{Ad}$ .

In the fine-tuning for EE2 and EE4, the hyper-parameters such as $β$ and the percentage of real samples in a mini-batch for $L_{p}^{Sup}$ were determined empirically through 5–7 tuning experiments for each parameter. Too large or small $β$ or percentage of real samples could cause less improvement in hand–eye coordination or even worse performance. The usage of real images for $L_{p}^{Sup}$ is crucial to avoid catastrophic forgetting of adapted perception modules. From Table 2, we can see that EE4 needs fewer real samples in a mini-batch than EE2. This is because $L_{p}^{Ad}$ in the fine-tuning for EE4 provides extra help to avoid catastrophic forgetting.

The baseline and combined networks were first evaluated in the real world without distractor objects on the table (Figure 14(A)), then the case with novel distractor objects in clutter (Figure 14(B)). In the case of Figure 14(B), 6 novel distractor objects (not seen in training) and 3 more white board eraser boxes (only the single box case was seen in training) were used in addition to those 11 distractor objects appeared in training. The metrics of control error (e) and success rate ( $σ$ ) were used. Here $e_{med}$ and $e_{Q 3}$ are the median and third quartile of control errors. Their results in 45 real-world reaching trials are listed in Table 3. The 45 trials were for the same targets and initial left arm configurations used in Section 5.4. The results for CC and the pseudo-inverse method (from Section 5.4) are also listed in the table (first two rows).

Fig. 14.

Real-world test cases for measuring end-to-end performance: (A) reaching the blue cuboid without distractor objects; (B) reaching with seen and novel (not seen in training) objects as distractors; (C) reaching with occlusion(s); (D) reaching when the target is moving.

Table 3.

Real-world end-to-end control performance.

Test condition	Network/method	$e_{med}$ (cm)	$e_{Q 3}$ (cm)	$σ$ (%)
Control with ground-truth $Θ$	The pseudo-inverse method	0.2	0.5	100
	CC	1.0	1.7	100
Single object (Figure 14(A))	Baseline: $P_{s 1}$ + the pseudo-inverse method	2.5	3.7	86.7
	EE0: $P_{s 1}$ + CC	2.7	3.9	86.7
	EE1: naively fine-tuned EE0	6.0	7.8	42.2
	EE2: EE0 fine-tuned using our approach	1.9	2.7	95.6
	EE3: PP + CC	2.1	2.6	95.6
	EE4: EE3 fine-tuned using our approach	1.6	2.9	97.8
Clutter with novel objects (Figure 14(B))	Baseline	2.6	4.3	80.0
	EE0	3.5	4.8	68.9
	EE1	11.3	17.7	13.3
	EE2	2.1	2.7	95.6
	EE3	2.9	3.4	93.3
	EE4	1.8	2.6	97.8
With occlusions (Figure 14(C))	EE4	4.6	8.5	48.9

From Table 3, we can observe similar trends in the results for the cases of Figure 14(A) and (B). In comparison, the baseline and combined networks have larger errors in the case with distractor objects (more realistic). In particular, Baseline achieved similar performances in the two test cases: similar median control errors (2.5 and 2.6 cm) and success rates (86.7% and 80.0%). Its errors are quite close to the perception error of $P_{s 1}$ (2.8 cm, Q3:3.9 cm), but much larger than that of the pseudo-inverse method. This shows that the performance bottleneck of Baseline mainly comes from the perception module, and that $P_{s 1}$ was generalized to both test cases.

In contrast, EE0 achieved a similar performance in the case of Figure 14(A) (2.7 cm $e_{med}$ and 86.7% $σ$ ), but has an obvious performance drop in the case of Figure 14(B): median error and success rate decreased to 3.5cm and 68.9%. The decreased error is much larger than that of $P_{s 1}$ and CC. This shows that, for a directly connected network (EE0), the performance bottleneck comes from not only its component perception module ( $P_{s 1}$ ) but also the coordination between perception and control (CC).

A similar performance drop can also be observed from the results of EE3 in the two test cases: $e_{med}$ decreased from 2.1 to 2.9cm. However, the decrease of its success rate is trivial (from 95.6% to 93.3%). In addition, EE3 has much smaller errors and much higher success rates than EE0 in both cases, although the two combined networks have the same control module CC and perception modules with similar errors ( $P_{s 1}$ : 2.8 cm, Q3:3.9cm; PP: 2.7 cm, Q3:3.9 cm). These results show that a directly connected network (EE3) consisting of a perception module trained using ADT (PP) has better hand–eye coordination, i.e., PP has an output distribution that better fits the control module CC than $P_{s 1}$ .

After weighted end-to-end fine-tuning, both EE2 and EE4 achieved better performances than EE0 and EE3. The improvement is significant, particularly in the case with novel distractor objects: EE2 has 40.0% smaller median control error and 38.8% higher success rate; EE4 has 37.9% smaller median control error. The improvement of EE4 in success rate is trivial, as EE3 already has a very high success rate (93.3%). In contrast, EE1 has a much worse performance than EE0 in the two test cases (its performance in the case of Figure 14(B) is even worse). This shows that our weighted end-to-end fine-tuning approach is able to significantly improve the performance of a combined network, but a naive approach could make the performance even worse. The end-to-end fine-tuning method works for both supervised perception adaptation (EE2) and ADT (EE4).

In addition, EE2 and EE4 even have much smaller control errors than the errors of their component perception modules ( $P_{s 1}$ and PP) in the challenging test case with novel distractor objects. If we individually evaluate the perception module in EE2 using the same test set in Section 5.1, its perception error increased from 2.8 (Q3:3.9)cm to 3.0 (Q3:5.2)cm. Similarly, the perception error of EE4 increased from 2.7 (Q3:3.9)cm to 4.2 (Q3:6.1)cm. These results indicate that the weighted end-to-end fine-tuning did improve the coordination between the perception and control modules (hand–eye coordination) in EE2 and EE4, rather than improving them individually.

To further evaluate the performance of EE4 in more challenging cases, we conducted more experiments with the target cuboid partially occluded, as shown in Figure 14(C). From the results in the last row of Table 3, we can observe an obvious performance drop compared with that for the cases of Figure 14(A) and (B), but EE4 was still able to reach half of the targets. We also tested EE4 in the case when the target cuboid was moving (Figure 14(D)). It was able to adapt to target position changes in real time and performed well in most cases as shown in the attached video.¹

6. Discussion

The results described previously lead us to the following observations.

6.1. Effectiveness of ADT

The significant reduction (50%) in the required number of labeled real images for sim-to-real transfer of visuo-motor policies shows the effectiveness of the ADT. The PI controller and discriminator network architecture both play important roles in the approach. An acceptable transfer accuracy (3.0cm) can be achieved with as few as 48 labeled real images, which is promising for robotic applications where labeling data is expensive or impractical.

However, the approach in its current version can only effectively use a number of unlabeled real images no more than two times the number of labeled ones. This precludes using very few labeled and many unlabeled real images to further reduce the cost. More investigation is necessary to tackle this problem, enabling a few shots transfer of visuo-motor policies from simulation to the real world.

6.2. Value of a modular structure and end-to-end fine-tuning

The significant performance improvement of EE2 and EE4 after end-to-end fine-tuning with weighted losses shows the effectiveness and scalability of the modular approach for more complicated tasks than the planar reaching (Zhang et al., 2017a). Benefiting from the modular structure as well as the ADT approach, visuo-motor policies for a table-top reaching task can be learned and transferred from simulation to the real world with just 33,225 simulated (including the 30,225 ones for end-to-end fine-tuning) as well as 93 labeled and 186 unlabeled real samples, achieving a comparable performance to pure domain randomization approaches (James et al., 2017) (the reaching stage of the multi-stage task) but with fewer training data in total.

The modular approach can also be used in more general ways. Although we explicitly equated the bottleneck layer with the target object position in this work, the bottleneck in general could be any explicit or latent low-dimensional features (as in an auto-encoder). The perception and control modules can also be trained with other methods such as unsupervised learning and reinforcement learning. The effectiveness of the modular approach for reinforcement learning (DQN) has been validated in a planar reaching task (Zhang et al., 2017a,b).

6.3. Domain randomization and adaptation

In Section 5.1, the perception module trained with 3,000 simulated images ( $P_{s 0}$ ) has a large error (47.1 cm), which is much higher than expected according to Tobin et al. (2017). Apart from the experiments in Section 5.1, we also trained a number of perception modules using simulated images with random RGB values for the table, floor, and robot body rather than $\pm 10 %$ changes around the reference colors. However, this did not bring significant accuracy improvement. Possible reasons include: too simple textures (only random RGB values); too simple randomization for light conditions; no simulated shadows; or sensitivity to domain randomization parameters and tuning.

Nevertheless, with the ADT approach, the adaptation with just a few labeled real images (as few as 48) is able to transfer a network from simulation to the real world, and needs fewer simulated images than pure domain randomization approaches (James et al., 2017; Tobin et al., 2017). The combination of domain randomization and adaptation is promising for more efficient deep neural network transfer.

7. Conclusion

In this article, we have proposed an ADT approach for cheaper transfer of visuo-motor policies from simulation to the real world. Its feasibility was demonstrated with a modular approach in the task of reaching a table-top object amongst clutter with a 7-DoF robotic arm in velocity mode. Our adversarial transfer approach reduced the labeled real data requirement by 50%. Successful transfer was achieved with only 93 labeled and 186 unlabeled real images. By using weighted losses to fine-tune a combined network in an end-to-end fashion, its reaching accuracy was improved significantly (37.9% better than that before fine-tuning), achieving a success rate of 97.8% with a median control error of 1.8cm. The learned policies are robust to novel distractor objects in clutter and even a moving target. The ADT along with the modular approach is promising for more efficient sim-to-real transfer of visuo-motor policies.

The datasets and code are available at https://github.com/Fanleyrobot/ADT.

Footnotes

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was conducted by the Australian Research Council Centre of Excellence for Robotic Vision (project number CE140100016). This research has also been supported by the Australian Research Council’s Linkage Infrastructure, Equipment and Facilities scheme (project number LE160100090). Additional computational resources and services were provided by the HPC and Research Support Group at QUT.

ORCID iD

Fangyi Zhang

Notes

References

Bateux

Marchand

Leitner

Chaumette

Corke

(2018) Training deep neural networks for visual servoing. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 3307–3314.

Bousmalis

Irpan

Wohlhart

et al. (2018) Using simulation and domain adaptation to improve efficiency of deep robotic grasping. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 4243–4250.

Bousmalis

Silberman

Dohan

Erhan

Krishnan

(2017) Unsupervised pixel-level domain adaptation with generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3722–3731.

D’Innocente

Carlucci

Colosi

Caputo

(2017) Bridging between computer and robot vision through data augmentation: A case study on object recognition. In: Proceedings of the International Conference on Computer Vision Systems (ICVS), pp. 384–393.

Fitzgerald

Goel

Thomaz

(2015) A similarity-based approach to skill transfer. In: Women in Robotics Workshop at Robotics: Science and Systems Conference (RSS).

Ganin

Ustinova

Ajakan

et al. (2016) Domain-adversarial training of neural networks. The Journal of Machine Learning Research 17(1): 2096–2030.

Demyanov

Chen

Garnavi

(2017) Generative openmax for multi-class open set classification. In: Proceedings of the British Machine Vision Conference (BMVC).

Gebru

Hoffman

Fei-Fei

(2017) Fine-grained recognition in the wild: A multi-task domain adaptation approach. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1358–1367.

Goodfellow

Pouget-Abadie

Mirza

et al. (2014) Generative adversarial nets. In: Advances in Neural Information Processing Systems (NIPS), pp. 2672–2680.

10.

Hinton

Salakhutdinov

(2006) Reducing the dimensionality of data with neural networks. Science 313(5786): 504–507.

11.

James

Davison

Johns

(2017) Transferring end-to-end visuomotor control from simulation to real world for a multi-stage task. In: Levine

Vanhoucke

Goldberg

(eds.) Proceedings of the 1st Annual Conference on Robot Learning (CoRL) (Proceedings of Machine Learning Research, Vol. 78), pp. 334–343.

12.

Katyal

Wang

Burli

(2017) Leveraging deep reinforcement learning for reaching robotic tasks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops.

13.

Krizhevsky

Sutskever

Hinton

(2012) Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 1097–1105.

14.

LeCun

(1988) A theoretical framework for back-propagation. In: Touretzky

Hinton

Sejnowski

(eds.) Proceedings of the 1988 Connectionist Models Summer School, CMU, Pittsburgh, PA. San Mateo, CA: Morgan Kaufmann, pp. 21–28.

15.

Lenz

Knepper

Saxena

(2015) DeepMPC: Learning deep latent features for model predictive control. In: Robotics: Science and Systems (RSS).

16.

Levine

Finn

Darrell

Abbeel

(2016a) End-to-end training of deep visuomotor policies. Journal of Machine Learning Research 17(39): 1–40.

17.

Levine

Pastor

Krizhevsky

Quillen

(2016b) Learning hand-eye coordination for robotic grasping with large-scale data collection. In: International Symposium on Experimental Robotics (ISER). Berlin: Springer, pp. 173–184.

18.

Luo

Zou

Hoffman

Fei-Fei

(2017) Label efficient learning of transferable representations acrosss domains and tasks. In: Advances in Neural Information Processing Systems (NIPS), pp. 164–176.

19.

Pan

Yang

(2010) A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22(10): 1345–1359.

20.

Pinto

Gupta

(2016) Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 3406–3413.

21.

Rohmer

Singh

Freese

(2013) V-REP: A versatile and scalable robot simulation framework. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1321–1326.

22.

Rusu

Rabinowitz

Desjardins

et al. (2016) Progressive neural networks. arXiv preprint arXiv:1606.04671.

23.

Rusu

Vecerik

Rothörl

Heess

Pascanu

Hadsell

(2017) Sim-to-real robot learning from pixels with progressive nets. In: Levine

Vanhoucke

Goldberg

(eds.) Proceedings of the 1st Annual Conference on Robot Learning (CoRL) (Proceedings of Machine Learning Research, Vol. 78), pp. 262–270.

24.

Simonyan

Zisserman

(2015) Very deep convolutional networks for large-scale image recognition. In: Proceedings of the International Conference on Learning Representations (ICLR).

25.

Sünderhauf

Brock

Scheirer

et al. (2018) The limits and potentials of deep learning for robotics. The International Journal of Robotics Research 37(4–5): 405–420.

26.

Taylor

Stone

(2009) Transfer learning for reinforcement learning domains: A survey. The Journal of Machine Learning Research 10: 1633–1685.

27.

Tieleman

Hinton

(2012) Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning 4: 2.

28.

Tobin

Fong

Ray

Schneider

Zaremba

Abbeel

(2017) Domain randomization for transferring deep neural networks from simulation to the real world. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 23–30.

29.

Tow

Shirazi

Leitner

Sünderhauf

Milford

Upcroft

(2016) A robustness analysis of deep Q networks. In: Proceedings of the Australasian Conference on Robotics and Automation (ACRA).

30.

Tzeng

Devin

Hoffman

et al. (2016) Adapting deep visuomotor representations with weak pairwise constraints. In: Workshop on the Algorithmic Foundations of Robotics (WAFR).

31.

Tzeng

Hoffman

Darrell

Saenko

(2015) Simultaneous deep transfer across domains and tasks. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4068–4076.

32.

Tzeng

Hoffman

Saenko

Darrell

(2017) Adversarial discriminative domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7167–7176.

33.

Viereck

Pas

Saenko

Platt

(2017) Learning a visuomotor controller for real world robotic grasping using simulated depth images. In: Levine

Vanhoucke

Goldberg

(eds.) Proceedings of the 1st Annual Conference on Robot Learning (CoRL) (Proceedings of Machine Learning Research, Vol. 78), pp. 291–300.

34.

Wulfmeier

Bewley

Posner

(2017) Addressing appearance change in outdoor robotics with adversarial domain adaptation. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1551–1558.

35.

Wulfmeier

Bewley

Posner

(2018) Incremental adversarial domain adaptation for continually changing environments. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 4489–4495.

36.

Zhang

Leitner

Milford

Corke

(2017a) Modular deep Q networks for sim-to-real transfer of visuo-motor policies. In: Proceedings of the Australasian Conference on Robotics and Automation (ACRA).

37.

Zhang

Leitner

Milford

Corke

(2017b) Tuning modular networks with weighted losses for hand–eye coordination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 496–497.

38.

Zhang

Leitner

Milford

Upcroft

Corke

(2015) Towards vision-based deep reinforcement learning for robotic motion control. In: Proceedings of the Australasian Conference on Robotics and Automation (ACRA).

Adversarial discriminative sim-to-real transfer of visuo-motor policies

Abstract

Keywords

1. Introduction

2. Related work

2.1. Learning from real datasets

2.2. Learning with simulation

2.3. Transfer learning

3. Methodology

3.1. Modular deep networks

3.1.1. Training method

3.2. ADT

4. Benchmark: robotic reaching

4.1. Task setup

4.2. Network architecture

4.3. Datasets collection

5. Experiments and results

5.1. Supervised perception adaptation

5.2. ADT

5.3. Important factors in ADT

5.3.1 Robustness to different random seeds

5.3.2. Effectiveness of the PI controller

5.3.3. Appropriate desired L D Ad for the PI controller

5.3.4. Appropriate discriminator networks

5.4. Control module performance

5.5 Hand–eye coordination

6. Discussion

6.1. Effectiveness of ADT

6.2. Value of a modular structure and end-to-end fine-tuning

6.3. Domain randomization and adaptation

7. Conclusion

Footnotes

Funding

ORCID iD

Notes

References

5.3.3. Appropriate desired $L_{D}^{Ad}$ for the PI controller