Abstract
The realization of universal robots is an ultimate goal of researchers. However, a key hurdle in achieving this goal lies in the robots’ ability to manipulate objects in their unstructured environments according to different tasks. The learning-based approach is considered an effective way to address generalization. The impressive performance of foundation models in the fields of computer vision and natural language suggests the potential of embedding foundation models into manipulation tasks as a viable path toward achieving general manipulation capability. However, we believe achieving general manipulation capability requires an overarching framework akin to auto driving. This framework should encompass multiple functional modules, with different foundation models assuming distinct roles in facilitating general manipulation capability. This survey focuses on the contributions of foundation models to robot learning for manipulation. We propose a comprehensive framework and detail how foundation models can address challenges in each module of the framework. What’s more, we examine current approaches, outline challenges, suggest future research directions, and identify potential risks associated with integrating foundation models into this domain.
Introduction
Researchers aim to create universal robots that can seamlessly integrate into human life to boost productivity, much like those depicted in the movie “I, Robot.” However, a key hurdle in achieving this lies in the robots’ ability to manipulate objects in their unstructured environments according to different tasks. There is abundant literature available for improving the general manipulation capability of robots, which can be roughly categorized into model-based and learning-based approaches (Zarrin et al., 2023). The real world is too diverse for universal robots and they must adapt to unstructured environments and arbitrary objects to manipulate effectively. Therefore, learning-based methods are crucial for manipulation tasks (Kleeberger et al., 2020).
The predominant methodologies in learning-based approaches are deep learning, reinforcement learning, and imitation learning. Learning-based methods have spanned from acquiring specific manipulation skills through labeled datasets like human demonstration, to acquiring abstract representations of manipulation tasks conducive to high-level planning, to exploring an object’s functionalities through interaction and encompassing various objectives in between (Kroemer et al., 2021). However, challenges persist, including (1) unnatural interaction with humans; (2) high-cost data collection; (3) limited perceptual capability; (4) non-intelligent hierarchy of skills; (5) inaccurate pre- and post-conditions and post-hoc correction; (6) unreliable skill learning; (7) poor environment transition (Hu et al., 2023b).
Foundation models are primarily pretrained on vast internet-scale datasets, enabling them to be fine-tuned for diverse tasks. Their significant advancements in vision and language processing contribute to mitigating the aforementioned challenges. Based on Firoozi et al. (2023) and considering the different input modalities and functionalities of the models, we categorize foundation models into the following six types. (1) (2) (3) (4) (5) (6)
In this survey, we investigate how foundation models are utilized in robot learning for manipulation, like Figure 1: (1) (2) (3) (4) (5) (6) LLMs help address challenges in interaction, manipulation data generation, hierarchy of skills, skill policy learning, and environment transition model. VLMs assist in tackling challenges in interaction, manipulation data generation, hierarchy of skills, pre- and post-conditions detection, skill policy learning, and perception. LMMs aid in addressing challenges in interaction and perception. VGMs tackle the challenge of manipulation data generation and environment transition. VFMs help address challenges in manipulation data generation, hierarchy of skills, pre- and post-conditions detection, skill policy learning, and perception. RFMs assist in addressing the challenge of skill policy learning.

These findings underscore the potential of embedding foundation models into manipulation tasks as a viable path toward achieving general manipulation capability. However, we do not believe that a single foundation model alone can achieve general manipulation capability. Although RFMs currently represent a single-model end-to-end training approach, ensuring safety and stability, particularly in achieving an over 99% success rate in manipulation tasks, remains a challenge. Achieving over a 99% success rate in manipulation tasks is crucial, as human manipulation success rates are around 99%. Without this level of accuracy, robots can’t replace humans (Kumar, 2023). Therefore, drawing inspiration from the development of autonomous driving systems (Hu et al., 2023c), achieving general manipulation capability necessitates an overarching framework that encompasses multiple functional modules, with different foundation models assuming distinct roles in facilitating general manipulation capability.
The ultimate general manipulation framework should be able to interact with human or other agent and control whole-body to manipulate arbitrary objects in open-world scenarios and achieve diverse manipulation tasks (McCarthy et al., 2024). Drawing from Kroemer et al. (2021) and this general manipulation definition, we propose a comprehensive framework for general manipulation. However, the interaction between robot and human involves not only recognizing intentions but also learning new skills or improving old skills from human experts in the external world. Open-world scenarios may be static or dynamic. Objects can be either rigid or deformable. Task objectives can vary from short-term to long-term. Furthermore, tasks may necessitate different degrees of precision with respect to contact points and applied forces/torques. Although there are many challenges, achieving general manipulation can be accomplished through multiple stages. We designate the restriction of the robot’s learning capability to improving old skills and to manipulating rigid objects in static scenes in order to achieve short-horizon task objectives with low precision requirements for contact points and forces/torques as Level 0 (L0). At the same time, we believe that improving the algorithm performance of different modules in the framework can support the transition from the L0 stage to the final general manipulation. Hence, we aim to use this survey not only to enlighten scholars on the issues that foundation models can address in robot learning for manipulation but also to stimulate their exploration of the general manipulation framework and the role various foundation models can play in the general manipulation framework.
Di Palo et al. (2023) and Firoozi et al. (2023) provide detailed descriptions of the application of foundation models in navigation and manipulation, but these lack thoughtful consideration of the relationship between foundation models across different applications. The survey most closely related to this paper is Xiao et al. (2023). Compared to this survey, our survey focuses on the contributions of foundation models to robot learning for manipulation, proposing a comprehensive framework and detailing how foundation models can address challenges in each module of the framework.
This paper is structured as follows: In Section Framework of robot learning for general manipulation, we present a comprehensive framework of robot learning for general manipulation, based on the developmental history of robot learning for manipulation and general manipulation definition. We elaborate on the impact of foundation models on each module in the framework in the following sections: Section Human/agent interaction; Section Pre- and post-conditions detection; Section Hierarchy of skills; Section State; Section Policy; Section Manipulation data generation. In Section Discussion, we discuss several issues of particular concern to us. In Section Conclusion, we summarize the contributions of this survey and identify the limitations of the current framework as well as the challenges in each module.
Framework of robot learning for general manipulation
Over the past decade, there has been a significant expansion in research concerning robot manipulation, with a focus on leveraging the growing accessibility of cost-effective robot arms and grippers to enable robots to interact directly with the environment in pursuit of their objectives. As the real world encompasses extensive variation, a robot cannot expect to possess an accurate model of its unstructured environment, the objects within it, or the skills necessary for manipulation in advance (Kroemer et al., 2021).
Early stage, robot manipulation is defined as learning a policy Π through deep learning, reinforcement learning, or imitation learning, etc. This policy controls the robot’s joint movements and executes tasks based on observations of the environment and the robot’s state S, mapping to actions α. Such as Rlafford (Geng et al., 2023b) and Graspnet (Fang et al., 2020b) take point cloud as input and output the target pose. This process is represented by the Skill Execution module, as shown in Figure 2. Framework of robot learning for general manipulation. The pre-conditions detection module P perceives the environment to identify objects and the affordances objects support. The interaction module I receives instruction from a human or other agent. It uses perception information from the pre-conditions detection module P to check for ambiguities in the instruction. If there are any ambiguities, it generates a question to clarify the instruction by asking the human or other agent. The hierarchy of skills module H generates subgoals by using precise instruction from the interaction module I and perception information from the pre-conditions detection module P. Each subgoal is then passed to the skill execution module. In the skill execution module, policy module Π generates action α based on the state S. To obtain the next state after executing the current action, state S can either perceive it from the environment or use the transition module T. To train the skill execution module, including the state module S, the policy module Π, and the transition module T, the manipulation data generation module is required. This module provides a task-level manipulation dataset. When issues arise during execution, corrective instruction is sent to the policy module Π for manual adjustment. Policy module Π modifies the current action to corrective action and saves corrective demonstration to the dataset for self-improvement of policy module Π. After skill execution, post-conditions detection module P determines the success of execution. If successful, proceed to the next subgoal; if not, the failure reason is conveyed to post-hoc correction module for self-correction.
In the mid-term, many tasks in robotics require a series of correct actions, which are often long-horizon tasks. For example, making a cup of tea with a robot involves multiple sequential steps such as boiling water, adding a tea bag, pouring hot water, etc. Learning to plan for long-horizon tasks is a central challenge in episodic learning problems (Wang et al., 2020b). Decomposing tasks has several advantages. It makes learning individual skills more efficient by breaking them into shorter-horizon, thus aiding exploration. Reusing skills in multiple settings can speed up learning by avoiding the need to relearn elements from scratch each time. Researchers train a hierarchy model to decompose the task into a sequence of subgoals (Ahn et al., 2022), and observe pre- and post-conditions to ensure that the prerequisites and outcomes of each subgoals are met (Cui et al., 2022). These three processes are represented as the Hierarchy of Skills module H, the Pre-conditions Detection module P, and the Post-conditions Detection module P in Figure 2. However, detecting only task success with post-conditions detection is insufficient. It should also identify the reasons for task failure to help the robot self-correct and improve success rates. Therefore, we add a Post-hoc Correction module, as shown in Figure 2.
Recently, researchers have realized that training policies require real-world interaction between the robot and environments, which inevitably increases the probability of unforeseen hazardous situations. Therefore, researchers aim to train the environment’s transition model T. Once the model is fitted, robot can generate samples based on it, significantly reducing the frequency of direct interaction between the robot and environments (Liu et al., 2024e). This process is represented as the Transition module T in Figure 2.
The modules described above are summarized from the development of robot learning for manipulation. However, they are still insufficient for a comprehensive framework for general manipulation. The ultimate general manipulation framework should be able to interact with human or other agent and control whole-body to manipulate arbitrary objects in open-world scenarios, achieving diverse manipulation tasks. When interacting with human or other agent to understand task objectives, the transmitted instruction may sometimes be unclear, such as when there are two cups in the environment, it needs to determine which cup to pour water. Therefore, we add the Interaction module I in Figure 2 to understand the precise task objective.
The aforementioned modules all require datasets for learning. The data collection process for the Hierarchy of skills H and Pre- and Post-conditions detection modules P is similar to that in the fields of CV and NLP. Compared to data collection in CV and NLP domains, gathering datasets for manipulation tasks requires the robot’s trajectory to train the policy. Therefore, we include the Manipulation Data Generation module in Figure 2.
We organize the framework of robot learning for general manipulation according to its development history and definition, as shown in Figure 2. In the caption of Figure 2, we outline the flow of the entire framework. To better illustrate the role of each module, we list the inputs and outputs of each module below, along with their specific functions. (1) (2) (3) (4) (5) (6) (7) (8)
Current research on foundation models for manipulation primarily focuses on several key modules: the Interaction module, the Pre- and Post-conditions Detection module, the Hierarchy of Skills module, the State module, the Policy module, and the Manipulation Data Generation module. The following section will provide an overview of these modules.
Human/agent interaction
There are two ways for human or other agent to interact with robot: (1) Providing task instruction to the robot to help it understand the task objective and complete the task independently (Khan et al., 2023). (2) Collaborating with the human or other agent to complete tasks, sharing workspace information, and conveying corrective instruction when useful or error-correcting information is identified to optimize the robot’s current action (Lynch et al., 2023).
When conveying task instruction to the robot, there may contain language ambiguity in the task goal, such as having both red and green cups in the scene, and the task instruction is “grasp the cup.” This ambiguity may confuse the robot regarding which color cup to grasp. To address this issue, the robot needs to inquire about and confirm the final task objective from the human or other agent, thus requiring enhancement of their capability in text generation and comprehension. When conveying corrective instruction to a robot, it needs to comprehend the meaning of the corrective instruction and translate corrective instruction into appropriate actions. For instance, if a robot is picking up a book from a shelf filled with books, lifting too quickly may cause other books to fall. Human or collaborating agent need to alert the robot that the current lifting action is dangerous and advise it to lift slowly. If necessary, the robot should also report its current execution state, such as its grasping speed, and inquire whether this speed is considered high. However, corrective instruction are diverse; thus, understanding them is essential.
In addressing instruction ambiguity and text generation and comprehension challenges, SeeAsk (Mo et al., 2023) utilizes CLIP’s perceptual module to identify objects in the scene and employs a fixed questioning template to organize language to ask about which object will be manipulated. Although the use of CLIP enhances the generalization ability for object recognition, it can’t generate text for asking questions and to comprehend answers from the outside world and SeeAsk (Mo et al., 2023) focuses solely on addressing ambiguities concerning object color and spatial relationship due to a fixed questioning template. KNOWNO (Ren et al., 2023a) utilizes LLM to score the next action to be taken. If the score difference between the top two actions is less than a threshold, it’s considered ambiguity, prompting a confirmation for the final action. This approach improves efficiency and autonomy. Matcha (Zhao et al., 2023c) not only employs vision but also utilizes haptic and sound senses to perceive object properties, such as material. When encountering ambiguity in object attribute recognition, it leverages LLM to generate inquiry content. CoELA (Zhang et al., 2023b) utilizes LLM as both a communication module and a planning module to enhance interaction text generation and comprehension, as well as task scheduling, with collaborative agent. LLM-GROP (Ding et al., 2023) utilizes LLM to extract latent commonsense knowledge embedded within task instruction. For example, a task instruction might be “set dinner table with plate and fork,” while the latent commonsense knowledge could be “fork is on the left of a bread plate.”
As for corrective instruction, LILAC (Cui et al., 2023) utilizes GPT-3 to distinguish between task instruction and corrective instruction. It then employs Distil-RoBERTa to extract text features and input them into the network to modify the robot’s original trajectory. LATTE (Bucker et al., 2023), on the other hand, employs BERT and CLIP to extract features from corrective instruction and observation images and input them into the network to modify the robot’s original trajectory. RT-H (Belkhale et al., 2024) employs VLMs in a two-step operation, initially outputting abstract delta-pose representations like “move left,” which are then converted into delta poses and human intervention can enable robots to adjust trajectories based on human language instruction.
Summary
Following Figure 3, LLMs using chain of thought efficiently identifies ambiguity, surpassing the limitations of enumerating ambiguity. LLMs’ comprehension of text effectively understands corrective instruction and transforms the original trajectory into a corrective trajectory. Foundation models for interaction module. Interaction mainly involves the exchange between task instruction and corrective instruction. Ambiguity often arises in task instruction interaction, hence robot needs to detect ambiguities. (1) One approach is to perceive objects in a multi-modal environment and enumerate possible ambiguities based on perception information (Mo et al., 2023). (2) Another approach involves using LLM to be the next step prediction module, which predicts and scores the next step; if the scores of the top two steps are less than δ, it is considered that the task goal is ambiguous (Ren et al., 2023a). (3) Strong comprehension skills are required during the transmission of corrective instruction, and the current mainstream approach involves using the encoder of LLM to extract tokens and input them into the policy to modify the original trajectory (Bucker et al., 2023).
Pre- and post-conditions detection
In pre- and post-conditions detection, it is necessary to identify the initial and termination conditions. In pre-conditions detection, recognize objects and observe the affordances of objects. In post-conditions detection, identify whether a task has been successfully executed and provide reasons for task failure after skill execution. Currently, there are few papers focusing on identifying termination conditions. Cui et al. (2022) utilizes CLIP to compare the target’s text or image with the termination environment to determine the success of task execution. Few articles are found in this study that address the output of task failure reasons after skill execution. RobotGPT (Jin et al., 2024) analysis task failure utilizes the positions of manipulated objects after execution, but task failure should be determined during execution. AHA (Duan et al., 2024) uses a large number of robotic failure trajectories to fine-tune the VLM. The fine-tuned VLM leverages keyframe trajectory images and task descriptions from the robot’s current task execution process to detect failures and provide detailed, adaptable failure explanations. Therefore, this section focuses on literature discussing foundation models in pre-conditions detection including object affordance and object recognition.
Object affordance
The affordances associated with an object represent the range of manipulations that the object affords the robot (Gibson, 2014). Early approaches addressed the issue by treating it as a supervised task (Kokic et al., 2017). However, the process of annotating datasets is laborious and time-consuming, making it impractical to exhaustively cover all geometric information present in real-world environments. Consequently, researchers are exploring the application of reinforcement learning, enabling robots to collect data and train affordance perception modules through continuous exploration (Wu et al., 2021). Nevertheless, current reinforcement learning methods are trained in simulated environments, leading to a significant sim-to-real gap. To address these challenges, researchers propose training the affordance perception module using videos of human interactions in real-world scenarios (Bahl et al., 2023; Ye et al., 2023b).
For supervised learning methods, GraspGPT (Tang et al., 2023) utilizes LLM outputs for object class descriptions and task descriptions. Object class descriptions detail the geometric shapes of each part of an object, while task descriptions outline the desired affordances for task execution, such as the types of manipulation actions to be taken. Integrating both components into the task-oriented grasp evaluator enhances the quality of the generated grasp pose. 3DAP (Nguyen et al., 2023) utilizes the text encoder of LLM for feature extraction. The extracted features from desired affordances text are inputted into both the affordance detection module and pose generation module. This enhances the quality of the predicted affordance map and the generated pose.
In reinforcement learning, ATLA (Ren et al., 2023b) utilizes GPT-3 to generate language descriptions of tools. These descriptions are then inputted into a pre-trained BERT model to obtain representations. The extracted features are finally fed into the SAC network module. Meta-learning techniques are employed to enhance the learning efficiency for the use of new tools. Xu et al. (2023a) employ CLIP’s text and image encoders to extract features from language instruction and scene image, improving the quality of grasp pose generation in the SAC module.
The methods mentioned above utilize foundation models to assist other learning methods in improving affordance maps or grasp poses. There are also direct approaches using foundation models to generate affordance maps and grasp poses. PartSLIP (Liu et al., 2023d) converts 3D point clouds into 2D rendering images and inputs multi-view 2D images and textual descriptions of object parts into GLIP for object parts detection, ultimately fusing 2D bounding boxes into 3D segmentation to generate affordance maps. However, PartSLIP requires manual definition text prompts and additional algorithms to convert 2D boxes back to 3D regions. UAD (Tang et al., 2025) clusters object points into fine-grained semantic regions based on pixel-wise features extracted from multi-view rendered images using DINOv2. It then queries the VLM to generate a set of task instructions and associates these instructions with the most relevant clustered region to construct the affordance map. LAN-grasp (Mirjalili et al., 2023) inputs human instruction into LLM, utilizing its prior knowledge to output the shape of part to be grasped. These shapes, along with the object’s 2D image, are then inputted into VLM to detect the bounding box for the grasping part. Finally, the bounding box and the point cloud from object 3D reconstruction are inputted into the grasp planner to generate grasp poses.
Object recognition
Object recognition can be categorized into two types: passive perception and active perception. Compared to passive perception, active perception adjusts the perspective to the areas of interest (Kroemer et al., 2021). Then, modeling manipulation tasks and generalizing manipulation skills necessitate representations of both the robot’s environment and the manipulated objects. These representations form the foundation for skill hierarchies, pre- and post-condition detection, skill learning, and transition model learning.
The Vision Transformers (ViTs) and similar attention-based neural networks have recently achieved state-of-the-art performance on numerous computer vision benchmarks (Han et al., 2022; Khan et al., 2022; Zhai et al., 2022) and the scaling of ViTs has driven breakthrough capability for vision models (Dehghani et al., 2023). The development of visual backbones not only advances pre-trained visual representations but also accelerates the progress of open-set perception tasks, such as segmentation and detection.
As for pre-trained visual representations, the algorithms mentioned have various training objectives: for instance, contrastive methods like Vi-PRoM (Caron et al., 2021), R3M (Nair et al., 2022), VIP (Ma et al., 2022), CLIP (Radford et al., 2021), and LIV (Ma et al., 2023a); distillation-based methods such as DINO (Caron et al., 2021); or masked autoencoder methods like MAP (Radosavovic et al., 2023) and MAE (He et al., 2022). The primary datasets utilized comprise the CLIP dataset (Radford et al., 2021), consisting of 400 million (image, text) pairs sourced from the internet, along with ImageNet (Deng et al., 2009), Ego4D (Grauman et al., 2022), and EgoNet (Jing et al., 2023).
Pre-trained visual representations have high transfer ability to policy learning (Xiao et al., 2022b; Yang et al., 2023c), but visual representation involves not just recognizing spatial features but also understanding semantic features. Masked autoencoding methods prioritize low-level spatial aspects, sacrificing high-level semantics, whereas contrastive learning methods focus on the inverse (Karamcheti et al., 2023). The fusion of masked autoencoder and contrastive learning is employed in both Voltron (Karamcheti et al., 2023) and iBOT (Zhou et al., 2021). The loss function achieves a balanced trade-off between these two aspects. To compare different pre-trained visual representations, benchmarks are established by CORTEXBENCH (Majumdar et al., 2023) and EmbCLIP (Khandelwal et al., 2022) to assess which model could provide a better “artificial visual cortex” for manipulation tasks. However, the models included in these benchmarks are still not comprehensive enough.
The aforementioned pre-trained visual representations mainly involve the extraction of features from 2D images. The experience of learning representations on 2D images can also be extended to other modalities. For the object point cloud modality, ULIP (Xue et al., 2023a) and ULIP2 (Xue et al., 2023b) employ contrastive learning to align features between point clouds and text-images. Point-BERT (Yu et al., 2022) uses the masked autoencoding method to learn point cloud features by reconstructing point clouds. GeDi (Poiesi and Boscaini, 2022) uses a contrastive learning approach to extract general and distinctive 3D local geometric information. In the haptic modality, MOSAIC (Tatiya et al., 2023) utilizes contrastive learning to train the haptic encoder.
As for segmentation, SAM (Kirillov et al., 2023) develops a transformer-based architecture and creates the largest segmentation dataset, with over one billion masks from 11 million images. The model is adaptable and enables zero-shot transfer to new tasks and image distributions. Fast-SAM (Zhao et al., 2023b) and Faster-SAM (Zhang et al., 2023a) aim to improve the training and inference speed of the network by enhancing its network structure. TAM (Yang et al., 2023a) merges SAM (Yang et al., 2023a) and XMem (Cheng and Schwing, 2022) for high-performance interactive tracking and segmentation in videos.
As for detection, traditional detection models are usually confined to a narrow range of semantic categories because of the cost and time involved in gathering localized training data within extensive or open-label domains. However, advancements in language encoders and contrastive image-text training enable open-set detection. Researchers integrate language into a closed-set detector to generalize open-set concepts, detecting various classes through language generalization despite being trained solely on existing bounding box annotations, such as OWL-ViT (Minderer et al., 2022), Grounding-DINO (Liu et al., 2023f), OVD (Zareian et al., 2021), ViLD (Gu et al., 2021), and DetCLIP (Yao et al., 2022a).
Deploying such models in open-set detection presents a significant challenge, primarily because even slight alterations in prompting can greatly impact performance. Fine-tuning can enhance a foundation model’s understanding of prompting. However, foundation models are often over-parameterized, leading to slow training processes. COOP (Zhou et al., 2022) maps prompting to a set of learnable vectors, which can be optimized through network training. In CLIP-Adapter (Gao et al., 2024), two extra linear layers are appended after the final layer of either the vision or language backbone to enable efficient few-shot transfer learning through fine-tuning.
The method for open-set detection on 2D images can be extended to the research direction of open-set detection on 3D point clouds. PointCLIP (Zhang et al., 2022b) utilizes pre-trained CLIP to extract multi-view depth image features of point cloud, then compares the extracted features with textual features to identify the point cloud category.
Summary
As shown in Figure 4, LLM provides object part-level knowledge via text, aiding in affordance map or grasp pose generation. Reinforcement learning can make robotic systems perform better though interaction than supervised learning trained on datasets. Direct use of foundation models avoids training. However, stability remains a concern. In object recognition, representation learning aligns multimodal features with text, improving model cognition, similar to human think with words. It also supports open-set perception tasks, like detection and segmentation. Foundation models for pre-conditions detection. As for object affordance, the main approaches of task-oriented grasp are supervised learning and reinforcement learning. Both methods utilize LLM to generate object part-level description and desired affordance description in task instruction, then fuse tokens and features into the original network through language encoder and image encoder to output task-oriented grasp pose (Ren et al., 2023b; Tang et al., 2023). In reinforcement learning, it is possible to choose between a LLM language encoder with a custom-designed image encoder, or a VLM language encoder with a VLM image encoder. When selecting the LLM language encoder with a custom image encoder, the LLM language encoder should be frozen, and the custom image encoder should be trained (Ren et al., 2023b). When using the VLM language encoder with the VLM image encoder, both encoders are typically frozen (Xu et al., 2023a). Direct using foundation method utilizes LLM to generate object part-level description and desired affordance description according to task instruction. VLM marks out the part of the object to grasp in the image based on the description (Liu et al., 2023d). As for object recognition, the representation learning methods in state perception mainly include contrastive learning (Radford et al., 2021), distillation-based learning (Caron et al., 2021), and masked autoencoder learning (Radosavovic et al., 2023). Masked autoencoding methods prioritize low-level spatial aspects, sacrificing high-level semantics, whereas contrastive learning methods focus on the inverse, the fusion of masked autoencoder and contrastive learning is employed in both Voltron (Karamcheti et al., 2023) and iBOT (Zhou et al., 2021). Multimodal representation learning focuses primarily on multimodal alignment (Tatiya et al., 2023; Xue et al., 2023b). Training the encoder with large-scale data and parameters has facilitated open-set perception, including tasks such as open-set detection, open-set segmentation. For instance, SAM (Kirillov et al., 2023) utilizes the MAE (He et al., 2022) and ViLD (Gu et al., 2021) employs the CLIP (Radford et al., 2021).
Hierarchy of skills
The skill hierarchy is closely related to the field of task and motion planning (TAMP). TAMP aims to address high-level instructions by organizing tasks in a sequence that ensures dynamic feasibility (Guo et al., 2023). There are three main types of classical TAMP methods: constraint-based TAMP, sampling-based TAMP, and optimization-based TAMP (Zhao et al., 2024). Constraint-based and sampling-based TAMP define the problem with goal conditions. Unlike optimization-based TAMP, these approaches often cannot assess or compare the quality of the generated plan or final state due to the lack of objective functions, such as when the goal is to pour as much water as possible into the cup (Zhang et al., 2022a; Zhao et al., 2024). However, optimization-based TAMP is sensitive to the initial conditions and goal setup of the problem (Zhao et al., 2024).
The scalability of classical TAMP methods is often constrained by the tree search problem size for complex tasks and the computational cost of evaluating heuristics and optimal trajectories (Zhao et al., 2024). Integrating learning-based approaches into TAMP enables informed decision-making based on prior examples and experiences and improves flexibility and generalizability (Guo et al., 2023). Models for skill hierarchy can be trained using text or videos, similar to how humans learn assembly procedures from instructional manuals or tutorial videos. As for tutorial videos, VLaMP (Patel et al., 2023) and SeeDo (Wang et al., 2024a) use trained models to understand human video operations and HourVideo (Chandrasegaran et al., 2024) proposes a benchmark dataset specifically designed for hour-long video-language understanding.
Traditional TAMP’s domain representations are usually manually specified by expert users such as PDDL (Silver et al., 2022). However, LLMs have been explored for processing and interpreting natural language inputs (Huang et al., 2022). They offer a novel approach to encoding the planning domain in a more intuitive and accessible way. Furthermore, LLMs’ acquisition of world knowledge and commonsense reasoning has the potential to improve the scalability and generalizability of skill hierarchy tasks (Driess et al., 2023; Jansen, 2020; Vemprala et al., 2023). Various benchmarks such as PlanBench (Valmeekam et al., 2023) can assess the planning and reasoning capability of LLMs.
LLMs possess a notable limitation: they lack practical experience, hindering their utility for decision-making within a specific context, so the output of LLMs often cannot be translated into executable actions for the robot. Huang et al. (2022) first use pre-trained causal LLM to break down high-level tasks into logical mid-level action plans. Then, a pre-trained masked LLM is employed to convert mid-level action plans into admissible actions. However, prompts usually require the context of the robot’s capability, its current state, and the environment. At the same time, LLMs are considered “forgetful” and don’t treat information in the system prompt as absolute. Despite efforts to reinforce task constraints in the objective prompt and extract numerical task contexts from the system prompt, storing them in data structures, errors caused by LLM forgetfulness remain unresolved (Chen and Huang, 2023).
To address the aforementioned issues, SayCan (Ahn et al., 2022) scores pre-trained tasks based on prompting and observation images, generating the task sequence with the highest score. Saycan provides a paradigm for generating action sequences using LLM, but there are still some drawbacks: (1) The generated action sequences do not incorporate user preferences. (2) Safety regulations are not adequately addressed. (3) The limitation of the skill library. (4) LLM focuses solely on reasoning when generating action sequences, neglecting feedback on action execution. (5) The limitation of scene grounding. GD (Huang et al., 2023d) proposes a paradigm to address the aforementioned issues by not only scoring the generated action sequence using LLM but also introducing a grounded function model for scoring the generated action sequence. The grounded function model encompasses token-conditioned robotic functions, such as affordance functions that capture the abilities of a robot based on its embodiment, safety functions, and more. This approach tackles drawbacks by designing grounded functions, avoiding fine-tuning in LLM.
Regarding user preferences, TidyBot (Wu et al., 2023b) trains LLM by collecting users’ preference data, enabling the trained LLM to choose behaviors that better align with user preferences. As for safety regulations, Yang et al. (2023d) incorporate ISO 61508, a global standard for safely deploying robots in industrial factory settings, into the constraints of the action sequence generation. As for the skill library, BOSS (Zhang et al., 2023c) suggests using LLMs’ rich knowledge to guide skills chaining in the skill library, aiming to create new skills through combinations. RoboGen (Wang et al., 2023e) employs generative models to create new skill task scenarios, then utilizes either reinforcement learning or gradient optimization methods to automatically learn new skills based on the reward function generated by the LLM. As for action execution feedback, REACT (Yao et al., 2022b), COWP (Ding et al., 2022), LLM-Planner (Song et al., 2023), CoPAL (Joublin et al., 2023), and PROGPROMPT (Singh et al., 2023) provide feedback on robot action execution to LLMs. This allows LLMs to adjust action sequences based on execution status, creating a closed-loop process for generating action sequences.
As for the limitation of scene grounding, LLMs need to inquire about the scene representation to determine the availability, relationship, and location of objects. NLMap (Chen et al., 2023a) proposes an open-vocabulary, queryable semantic representation map built on ViLD and CLIP. This map outputs the pose of related objects based on task instruction, which are then handed over to the LLM for planning. Text2Motion (Lin et al., 2023) incorporates a geometric value function on top of the value function, enabling the robot to select actions that adhere to geometric constraints based on scene descriptions. Xu et al. (2023b) explore the possibility of teaching robots to creatively utilize tools within scenarios, which involve implicit physical limitations and require long-term planning. VILA (Hu et al., 2023a) seamlessly incorporates perceptual data into ChatGPT-4V for its reasoning and planning processes, facilitating a deep comprehension of common sense knowledge within the visual domain, encompassing spatial arrangements and object characteristics. PHYSOBJECTS (Gao et al., 2023) fine-tunes a VLM to enhance its understanding of physical object attributes, such as material. This integration of a physically informed VLM into an interactive framework with a LLM enhances task planning performance in tasks incorporating instruction related to physical object attributes. SpatialVLM (Chen et al., 2024b) and 3D-LLM (Hong et al., 2023) utilize a 2D pre-trained VLM to train on collected 3D datasets, enhancing capabilities related to 3D tasks while maintaining the abilities of previous tasks.
The hierarchy of skills possessed by LLMs or VLMs can be applied not only to single agent but also to multiple agents. SMART-LLM (Kannan et al., 2023) utilizes LLM for the hierarchy of skills and allocates each task to every agent through the task assignment module.
Regardless of whether the prompting input to LLMs is in natural language or PDDL format, the hierarchy of skills possessed by LLMs still exhibits instability (Silver et al., 2022). Hence, researchers are exploring approaches that integrate LLMs with classical PDDL-based planning methods for the hierarchy of skills. LLM + P (Liu et al., 2023a) utilize LLMs to translate natural language into PDDL and input into a classical planner for the hierarchy of skills. Xie et al. (2023b) indicate that LLMs exhibit greater efficacy in translation tasks as opposed to planning.
Summary
According to Figure 5, the hierarchy of skills is mainly divided into video instruction and language instruction. VFM and LLM play roles in perception and reasoning. Language instruction is further divided into methods based on foundation models and methods combining foundation models with classical TAMP. As shown in Appendix C Table 6, there is currently no significant research comparing video instruction and language instruction. However, from the modality perspective, video provides more temporal or spatial dependencies regarding tasks compared to language. This also means that video instruction requires a higher level of hierarchy of skills, not only needing to output task plans but also understanding the task and scene constraints from the video. Language instruction is more suitable for interaction and reasoning for LLMs/VLMs. However, the two share some similarities. Current research on the hierarchy of skills in language and video instruction tends to focus on SOTA VLMs, and both have similar failure modes, indicating that both face challenges in perception and reasoning. In the language instruction methods, the combination of foundation models and classical TAMP is more reliable than foundation models alone, but it also faces limitations in generalization. Therefore, how to better integrate foundation models with classical TAMP requires further research. Foundation models for hierarchy of skills. (1) Utilize human operation video operation to learn the skill sequence for task execution, decompose the video of the user’s progress so far into observations and human actions through segmentation, and input them along with task instruction into a pre-trained language model to predict the next step (Patel et al., 2023). (2) LLM scores the skills in the skill library based on task instruction and the skills already executed, and the value function also scores the skills in the skill library based on observation images. The highest-scoring skill, obtained by multiplying the two scores, is selected as the next step (Ahn et al., 2022). The value function can consider multiple factors such as affordance, safety, user preference, and more (Huang et al., 2023d), and these considerations can also be fine-tuning LLM (Wu et al., 2023b). (3) LLM assists the classical planner by translating task instruction into PDDL descriptions, sending them to the classical planner to generate a PDDL plan, and then translating the PDDL plan into a natural language plan using LLM (Liu et al., 2023a).
State
The State module focuses on perceiving the environment, objects, and robot states. Section Pre- and post-conditions detection introduces low-level perception methods. This section explains high-level approaches for 3D reconstruction and pose estimation.
3D reconstruction
3D reconstruction involves capturing both the shape and appearance of an object or scene (Wikipedia Contributors, 2025). 3D reconstruction methods are divided into passive and active types (Butime et al., 2006). Active methods involve contact or project some form of energy onto the object, such as Laser Scanning (Butime et al., 2006) and X-ray (Maken and Gupta, 2023). These devices have high accuracy, but they are usually expensive. Therefore, various studies focus on 3D reconstruction using consumer RGBD cameras, such as Microsoft Kinect, Intel RealSense, Google Tango, and ORBBEC Gemini (Li et al., 2022).
These consumer cameras typically use principles such as structured light, time of flight, and traditional photometric stereo for depth estimation (Zhou et al., 2024a) and 3D representation can be generated by registering them using camera poses (Huang et al., 2024b). However, when surfaces are shiny, bright, transparent, textureless, or distant from the camera, depth images produced by consumer cameras are often noisy and incomplete. Several studies have addressed this challenging problem by learning to restore depth images (Dai et al., 2022; Fang et al., 2022; Sajjan et al., 2020). However, the correct depth information may already be lost in the original depth. ASGrasp (Shi et al., 2024) demonstrates that 3D reconstruction using raw multi-view images from consumer cameras is better than restoring the original depth. Many studies use a single image for 3D reconstruction (Fu et al., 2021). However, a single image loses a significant amount of information, resulting in lower accuracy. Despite this, the zero-shot capability of current single image 3D reconstruction has led to its widespread application in simulation scene generation (in Subsection Scene and demonstration generation).
The 3D representation for 3D reconstruction can be expressed as explicit and implicit expressions (Zhou et al., 2024a). Explicit expressions include point clouds (Shi et al., 2024), voxels (Jiang et al., 2021), and meshes (Wen et al., 2019)). The three representations can be converted into each other (Jiang et al., 2021), but each has its own advantages. A point cloud is made up of discrete points in space, providing flexibility in processing. A voxel can store spatial information inside an object but comes with high space complexity. A mesh uses triangle meshes to represent complex shapes and details accurately, such as deformation (Wen et al., 2019). It ensures the projection is always convex, making it easier to rasterize (Zhou et al., 2024a). Implicit expressions represent 3D geometry using a function, such as signed distance function (SDF) (Chabra et al., 2020), occupancy field (Jiang et al., 2021), and radiance field (Mildenhall et al., 2021). They offer differentiability and efficient storage, making them a powerful tool. In contrast, explicit expressions tend to be more intuitive.
GIGA (Jiang et al., 2021) points out that manipulation requires a fine-grained understanding of local geometry details. Implicit representations, due to their continuous and differentiable nature, can represent smooth surfaces at high resolution. As a result, there is a growing research using implicit representations for manipulation tasks (Dai et al., 2023; Lu et al., 2024). Current state-of-the-art methods for representing scenes using implicit representations are mainly divided into Nerf-based (Wang et al., 2023b) and 3DGS-based approaches (Kulhanek and Sattler, 2024). Compared to Nerf (Mildenhall et al., 2021), 3DGS offers better real-time performance (Kerbl et al., 2023). However, these implicit 3D representations currently lack scene semantics and not easily editable for 3D modifications (Bai et al., 2024).
As for scene semantics, semantic-NeRF (Zhi et al., 2021) employs manually annotated semantic labels to jointly encode semantics, appearance, and geometry using NeRF. Manual annotation is time-consuming and labor-intensive. Due to the foundation model’s robust open-set capability for objects, DFF (Shen et al., 2023), CLIP-Fields (Shafiullah et al., 2022), and LERF (Kerr et al., 2023) employ CLIP image encoder to extract features from multi-view 2D images for NeRF (Mildenhall et al., 2021) reconstruction. These features are integrated as part of the output of the NeRF network, enriching the semantic information of the reconstructed 3D scenes. When a text prompt is provided, the features output by the CLIP text encoder can be compared with the CLIP image features output by NeRF to form a relevancy map. This relevancy map can support downstream tasks, such as semantic scene completion and object localization (Ha and Song, 2022). Since CLIP can only provide image-level features, the relevancy map lacks precise pixel-level object boundary information. 3DOVS (Liu et al., 2023c) incorporates DINO features into the NeRF output to distill object boundary information. OV-NeRF (Liao et al., 2024) addresses the issues of coarse relevancy maps and view-inconsistent relevancy maps through SAM and cross-view self-enhancement. FMGS (Zuo et al., 2025) transfers this concept from NeRF to 3DGS, achieving 851X faster inference. Although foundation models, such as CLIP and DINO, enable 3D open-set semantic scene understanding, the performance is limited by the foundation models themselves. For example, CLIP is constrained by the bag-of-words limitation (Kerr et al., 2023).
The image features output by NeRF can be used to build a relevancy map. They can also be lifted into 3D space through multi-view images, serving as 3D features for downstream tasks (Ze et al., 2023). 3D-LLM (Hong et al., 2023) extracts 2D features from multi-view rendered images using the CLIP image encoder. These features are then fused into 3D features through Direct Reconstruct, gradSLAM (Jatavallabhula et al., 2023), or Neural Field methods (Hong et al., 2023), endowing 3D features with semantic information.
For implicit 3D editing, some methods use human scribbles to edit 3D shape and appearance (Li and Pan, 2024; Liu et al., 2021; Schwarz et al., 2020; Zhang et al., 2023d). However, this approach is not intuitive enough. With the development of foundation models, many methods for implicit 3D editing using text prompts have emerged. CLIP-NeRF (Wang et al., 2022a) integrates semantic features extracted by CLIP into NeRF reconstruction to change object shape and appearance during rendering. However, CLIP-based approaches cannot precisely modify specific local regions. Instruct-NeRF2NeRF (Haque et al., 2023) utilizes InstructPix2Pix (Brooks et al., 2023) to iteratively edit multiview input images and optimize the underlying scene in NeRF. This process produces a refined 3D scene that adheres to the edit instruction. However, InstructPix2Pix modifies the entire image. As a result, regions that are not desired may also be altered. DreamEditor (Zhuang et al., 2023) uses Dreambooth (Ruiz et al., 2023) to generate 2D editing masks. These masks are then converted into 3D editing regions through back projection. This approach enables precise local editing.
Pose estimation
Object pose estimation can be divided into marker-based and markerless methods (Karashchuk et al., 2021). Marker-based methods require attaching passive or active markers (Cassinis and Tampalini, 2007) to the object. These methods achieve high accuracy in pose estimation. For example, the NDI Polaris Vega XT, commonly used in medical robotics, can achieve an accuracy of 0.12 mm RMS (NDI, 2024). However, in unstructured environments, it is not feasible to attach specific markers to every object. Therefore, achieving object pose estimation in unstructured markerless environment is necessary.
From the perspective of generalization, pose estimation methods can be classified into instance-level, category-level, and unseen object approaches (Liu et al., 2024c). Instance-level methods can estimate pose accurately for specific object instances on which they are trained. However, they struggle with novel objects. To improve the model’s adaptation for pose estimation of novel objects, category-level approaches use geometric priors from objects of the same category to estimate the pose of the novel object without requiring its 3D model (Wang et al., 2019). Unseen object approaches typically rely on the 3D model of the novel object to estimate its pose (Caraffa et al., 2024).
Category-level and unseen object approaches can also be primarily classified into model-free and model-based approaches (Liu et al., 2024c). Model-free methods do not require prior knowledge of the object’s 3D model. These methods typically regress the object pose using neural network (Guan et al., 2024). However, these methods require large amounts of data for the neural network to learn the geometric priors of the object. In contrast, model-based methods use a known 3D model of the object and usually lead the BOP benchmark for object pose estimation (Burde et al., 2024). However, obtaining accurate object 3D models in the real world is not easy. The advancement of multi-view image 3D reconstruction technology bridges the gap between model-based and model-free real-world applications (Burde et al., 2024).
The test input modalities for the model-based approach include RGB, depth, and RGBD. Currently, the RGBD modality leads the BOP benchmark for object pose estimation. The optimization goals are primarily divided into three parts: 2D–2D correspondences followed by regression (Nguyen et al., 2024), 2D–3D correspondences followed by PnP (Ausserlechner et al., 2024; Li et al., 2023b), and 3D–3D correspondences followed by least squares fitting (Caraffa et al., 2024; Lin et al., 2024b). However, pose estimation accuracy remains a challenge when dealing with occlusion, specularity, symmetry, and textureless objects (Guan et al., 2024). Many methods use the predicted pose as a coarse result and refine it to obtain a fine result (Labbé et al., 2022; Moon et al., 2024; Wen et al., 2024a).
The pose estimation of moving objects mainly involves two methods. 1) Some single-image 6D pose estimation methods are fast and re-estimate poses from scratch for each frame. However, this approach is inefficient and results in less coherent estimations (Wen and Bekris, 2021). 2) Pose tracking utilizes temporal cues to improve pose prediction. It enhances efficiency, smoothness, and accuracy in video sequences. Current pose-tracking methods are mainly divided into probabilistic tracking (Deng et al., 2021; Issac et al., 2016; Stoiber et al., 2022) and optimization-based tracking (Li et al., 2018; Lin et al., 2022; Wang et al., 2020a; Wen et al., 2020). Pose tracking faces challenges mainly from motion blur, incremental error drift, and occlusion. To address these issues, BundleSDF (Wen et al., 2023a) and BundleTrack (Wen and Bekris, 2021) use an online pose graph optimization process.
There are some research integrating foundation models into pose estimation. As for category-level, OV9D (Cai et al., 2024) utilizes DINO and VQVAE to extract visual features from images, while CLIP is used to extract text features from category prompts. These features are then fed into the Stable Diffusion UNet (Rombach et al., 2022) to generate a normalized object coordinate space (NOCS) map. This method achieves generalizability to unseen categories and enables open-set pose estimation. In unseen object pose estimation with foundation models, FoundationPose (Wen et al., 2024a) utilizes LLM-aided synthetic data generation at scale to ensure strong generalizability for novel object pose estimation & tracking. SAM-6D (Lin et al., 2024b) and ZS6D (Ausserlechner et al., 2024) leverage SAM to generate valid proposals, enabling zero-shot 6D pose estimation. FreeZe (Caraffa et al., 2024) employs frozen GeDi (Poiesi and Boscaini, 2022) and DINO (Caron et al., 2021) to extract both geometric and visual features from the query object model and the target object’s RGBD observation image. It then uses 3D-3D fused feature correspondences to obtain the 6D pose. Due to the foundation models’ robust capability in discriminative feature extraction, FreeZe achieves state-of-the-art results without the need for any data or training. Overall, foundation models primarily improve generalization for novel object pose estimation in three areas: data, object recognition, and feature extraction. However, the performance is limited by the foundation models themselves. For example, foundation models are large in size (Caraffa et al., 2024) and SAM may hallucinate in object segmentation (Kirillov et al., 2023).
Summary
Following Figure 6, VLM and VFM assist implicit 3D reconstruction by generating relevancy maps that include semantic information (Kerr et al., 2023). They can also be employed in 2D-to-3D lifting to extract 3D features, encompassing texture, semantic, and spatial information (Hong et al., 2023). VGM aids in generating edited images and modifying 3D scenes based on these edited images (Haque et al., 2023). FreeZe (Caraffa et al., 2024) achieves state-of-the-art result in pose estimation by extracting discriminative features through 2D-to-3D lifting and LMM. Foundation models for state. The foundation models have three main applications in 3D reconstruction: 3D open-set semantic scene understanding, lifting 2D features to 3D space, and implicit 3D editing. In 3D open-set semantic scene understanding, the main pipeline is to use image features extracted by the VFM encoder and VLM image encoder as input for NeRF. Then, semantic text features extracted by the VLM language encoder are used in conjunction with the image features from NeRF to generate a relevancy map through a relevancy extractor (Kerr et al., 2023). This relevancy map can support downstream tasks, such as semantic scene completion and object localization (Ha and Song, 2022). As for lifting, using the VLM image encoder to extract features from 3D data multi-view images and lift them into 3D features can incorporate semantic information into the 3D features. The lifting methods include direct reconstruction, gradSLAM, and Neural Field (Hong et al., 2023). For implicit 3D editing, the current mainstream pipeline is to input the image rendered by NeRF and the editing prompt into the VGM to generate the updated image. The updated image is then fed back into NeRF for training, modifying NeRF’s radiance field representation of the 3D scene (Haque et al., 2023). Pose estimation with foundation models achieves state-of-the-art results (Caraffa et al., 2024). The main method is 2D-lifting-3D. It extracts texture features from the object model and observation RGBD image. LMM extracts geometric features from the object model and observation RGBD image. The fused features are then used to estimate the 6D pose through 3D–3D correspondences.
Policy
The policy is divided into two categories: object/action-centric methods and end-to-end methods. Object/action-centric methods extract attributes from observations, such as bounding boxes, masks (Sajjan et al., 2020), or 3D spatial action-value map (Shi et al., 2024). These extracted attributes are then transformed into either a sequence of key poses or a single key pose, which is used in motion planning to guide robot motion. End-to-end methods directly map observation to robot action (Chi et al., 2023). They eliminate the need for attribute extraction.
End-to-end methods are mainly divided into reinforcement learning (Herzog et al., 2023) and imitation learning (Dasari et al., 2019). Recent end-to-end methods have made significant progress. ACT (Zhao et al., 2023a) uses action chunks to reduce compounding errors. Diffusion policy (Chi et al., 2023) applies the idea of diffusion to visuo-motor control, tackling challenges such as action multimodality and sequential correlation to handle high-dimensional action sequences.
However, the above methods are all one-model-for-one-task, lacking general-purpose capability. Due to the development of foundation models, general-purpose models have advanced. The representation of task instruction can be categorized into four types: language, human video, goal image, and multimodal prompts. BC-Z (Jang et al., 2022) and Vid2Robot (Jain et al., 2024) introduce a video-conditioned policy that uses human video as task instructions. DALL-E-Bot (Kapelyukh et al., 2023) employs DALL-E to generate target images for tasks and generates actions for manipulation by combining the target image with the observation image. VIMA (Jiang et al., 2023) and MIDAS (Li et al., 2023c) observe that many robot manipulation tasks can be represented as multimodal prompts intertwining language and image/video frames. They construct multimodal prompts manipulation datasets and utilize pre-trained language foundation models for fine-tuning to control robot outputs. MUTEX (Shah et al., 2023) extends instruction to various modalities and develops speech-conditioned, speech-goal-conditioned, image-goal-conditioned, and text-goal-conditioned.
Language-conditioned general-purpose policies remain the predominant paradigm in current research. RT-2 (Brohan et al., 2023) refers to this approach as
The above methods integrate foundation models into policy models to guide action generation. Currently, some approaches leverage foundation models to assist in reinforcement learning training.
VLAC
Code generation and program synthesis have been demonstrated to be capable of developing generalizable, interpretable policy (Trivedi et al., 2021). However, a robot capable of generating code for multiple tasks, rich knowledge across various domains is essential (Ellis et al., 2023). Therefore, scholars aim to apply the prior knowledge of LLM to code generation task (Austin et al., 2021; Chen et al., 2021b). Code-As-Policy (Liang et al., 2023) demonstrates the possibility of using LLMs to directly generate code for robot execution based on prompts. The study shows that (1) code-writing LLMs enable novel reasoning capability, such as encoding spatial relationships by leveraging familiarity with third-party libraries and (2) hierarchical code-writing inspired by recursive summarization improves code generation. From-text-to-motion (Yoshida et al., 2025) translates descriptions of human actions into humanoid robot motion code, enabling it to perform various tasks autonomously.
VLAKP
The utilization of foundation models to generate key poses can be categorized into two approaches: (1) Directly using existing foundation models to output key poses. (2) Training RFMs to generate key poses through imitation learning.
Utilizing foundation models pre-trained on existing large-scale internet datasets enables the direct perception of observation images and outputting key poses. Instruct2Act (Huang et al., 2023b) utilizes CLIP and SAM to identify manipulated objects within an observation image and outputs the 3D position of these manipulated objects from 2D image. Voxposer (Huang et al., 2023c) utilizes LLMs to generate code that interacts with VLMs to produce affordance maps and constraint maps, collectively referred to as value maps, grounded in the robot’s observation space. These composed value maps serve as objective functions for motion planners to synthesize trajectories for robot manipulation. ReKep (Huang et al., 2024a) uses VFM and VLM to extract relational keypoint constraints from language instructions and RGBD observations. It then applies an optimization solver to generate a series of end-effector poses.
As for imitation learning methods, CLIPort (Shridhar et al., 2021) demonstrates the capability of imitation learning in language-conditioned general manipulation. However, CLIPort (Shridhar et al., 2021) addresses 4-DoF end-effector pose prediction by treating it as a pixel classification problem. Keypoint-based approaches are extended to handle 6-DoF end-effector pose prediction. Due to keypoint-based methods primarily focus on 3D scene-to-action tasks, these methods become computationally expensive as resolution requirements increase (Ke et al., 2024). To address high spatial resolution, PerAct (Shridhar et al., 2023) uses the latent set self-attention of Perceiver (Jaegle et al., 2021), which has linear complexity with voxels. Act3D (Gervet et al., 2023) represents scenes as a continuous 3D feature field, transforming 2D model features into 3D feature clouds using sensed depth and learns a 3D feature field of arbitrary spatial resolution through recurrent coarse-to-fine point sampling.
Some research has extended the work on PerAct (Shridhar et al., 2023) and Act3D (Gervet et al., 2023). ChainedDiffuser (Xian et al., 2023) builds upon Act3D (Gervet et al., 2023) by replacing the motion planner with a diffusion model. This approach addresses the challenges of continuous interaction tasks. The 3D Diffuser Actor (Ke et al., 2024), similar to Act3D (Gervet et al., 2023), employs tokenized 3D scene representations. However, unlike Act3D (Gervet et al., 2023) and 3D Diffusion Policy (Ze et al., 2024b) with 1D point cloud embeddings, 3D Diffuser Actor (Ke et al., 2024) leverages CLIP to extract features from 2D images and aggregates them into a 3D scene representation. GNFactor (Ze et al., 2023) improves upon PerAct (Shridhar et al., 2023) by enhancing 3D semantic features. It achieves this by distilling pre-trained semantic features from 2D foundation models into Neural Radiance Fields (NeRFs). DNAct (Yan et al., 2024) builds on GNFactor (Ze et al., 2023) and transforms the perceiver model into a diffusion head. VoxAct-B (Liu et al., 2024b) uses VLM to divide the task into subtasks for the left arm and the right arm and applies PerAct (Shridhar et al., 2023) to generate separate key poses for each arm.
The current imitation learning approaches also include methods using large-scale LLM/VLM. LEO (Huang et al., 2023a) expands upon language foundation models by incorporating modalities like images and 3D point clouds. It fine-tunes manipulation datasets using the LoRA method. This showcases the ability to transfer the original foundation model to more modalities and manipulation tasks. Xu et al. (2024) considers the object motion produced by LLM/VLM, the object’s physical properties, and the end-effector’s design and creates a ManiFoundation model to generate the key pose. However, the key pose output by the ManiFoundation model is not 6D pose. Instead, it provides the positions of multiple contact points and the force to be applied at each contact point. 3D-VLA (Zhen et al., 2024) can generate the final state image and point cloud based on user input. This goal state is then used to create key poses in 3D VLA.
VLADP
The policy model for outputting dense pose resembles more closely the paradigm of human task execution, as it does not require camera and spatial calibration or robot body configuration. Instead, it takes observation images as input and directly outputs the direction and magnitude of the next waypoint. While this approach is more end-to-end, it still necessitates extensive data training to embed the parameters of robot execution in the policy model’s hidden layers.
Effective robotic multi-task learning necessitates a high-capacity model, hence Gato (Reed et al., 2022) and RT-1 (Brohan et al., 2022) devise transformer-based architectures. Nonetheless, RT-1 and Gato differ; RT-1’s input lacks proprioception from the robot body, while Gato incorporates proprioception. Building upon Gato, RoboCat (Bousmalis et al., 2023) demonstrates that a large sequence model can learn unseen tasks through few-shot learning. It proposes a simple but effective self-improvement process. Additionally, it shows that predicting both the next action and the hindsight image after executing that action can enhance performance. Building upon RT-1, RoboAgent (Bharadhwaj et al., 2023) enhances model generalization and stability through data augmentation and action-chunking. MOO (Stone et al., 2023) leverages Owl-ViT to extract object locations from observation images, enhancing RT-1’s open-set detection capability.
As for multi-task reinforcement learning, PI-QT-Opt (Lee et al., 2023) leverages a large-scale, multi-task dataset and employs a model-free off-policy reinforcement learning approach for training. Q-Transformer (Chebotar et al., 2023) facilitates training high-capacity sequential architectures on mixed-quality data by applying transformer models to RL.
Utilizing pre-trained VLMs (Zhang et al., 2024b) for fine-tuning to construct RFMs is considered efficient. RT-2 (Brohan et al., 2023) collects manipulation trajectory data and fine-tunes manipulation datasets using VLM models like PaLI-X (Chen et al., 2023b) and PaLM-E (Driess et al., 2023) after treating pose as tokens. However, democratizing such an expensive framework for all robotics practitioners proves challenging as it relies on private models and necessitates extensive co-fine-tuning on vision-language data to fully exhibit effectiveness. Consequently, there is an urgent need within robot communities for a low-cost alternative solution, hence RoboFlamingo (Li et al., 2023d) and OpenVLA (Kim et al., 2024) emerge, effectively enabling a robot manipulation policy with VLMs.
However, this approach necessitates a lot of data for the hidden layers to learn parameters related to the robot body, objects, and environment. Open X-Embodiment (Padalkar et al., 2023) assembles a dataset from 22 different robots, demonstrating 527 skills (160,266 tasks). However, the current Open X-Embodiment dataset faces the heterogeneity dataset challenge. Octo (Team et al., 2024) and HPT (Wang et al., 2024c) propose multi-module networks to address this issue. RDT-1 B (Liu et al., 2024d) and PI (Black et al., 2024) propose unified action space to address this issue. The Open X-Embodiment dataset lacks in-the-wild scenes. DROID (Khazatsky et al., 2024) introduces an “in-the-wild” robot manipulation dataset. It contains 76k trajectories, or 350 h of interaction data. This data is collected from 564 scenes, 86 tasks, and 52 buildings over 12 months.
Internet videos contain information on the physics and dynamics of the world, some studies have explored training foundation models using both video datasets and manipulation data. SuSIE (Black et al., 2023) uses an image-editing diffusion model. This model is fine-tuned on human videos and robot rollouts. It acts as a high-level planner by proposing intermediate subgoals. These subgoals can be accomplished by a low-level controller. The two training steps of SuSIE do not share weights. GR-1 (Wu et al., 2023a) is initially trained on a large-scale video dataset for video prediction, and then seamlessly fine-tuned with manipulation data. GR-2 (Cheang et al., 2024) uses VQGAN to convert each image into discrete tokens and is trained with a larger text-video dataset than GR-1. LAPA (Ye et al., 2024) begins by extracting the latent delta action between video frames. It then labels the video dataset with this information. These labeled datasets are used to train a VLM network. Finally, a small-scale robot manipulation dataset is applied for fine-tuning, enabling the mapping of latent actions to robot actions. Go-1 (Bu et al., 2025) trains latent actions using human videos and combines those with manipulation data. Then, it trains the VLA on the merged dataset to boost the model’s generalization.
Previous studies, such as GR-1 (Wu et al., 2023a) and GR-2 (Cheang et al., 2024), train the policy head using MSE regression. In contrast, OpenVLA (Kim et al., 2024) and RT-2 (Brohan et al., 2023) apply next-token prediction for their policy head. Building on the success of diffusion policy (Chi et al., 2023), PI0 (Black et al.) and TinyVLA (Wen et al., 2024b) adopt diffusion head as its policy head, achieving better performance than OpenVLA. In order to address higher degree of multi-modality in the distribution of feasible actions for bimanual manipulation, RDT-1B (Liu et al., 2024d) utilizes Diffusion Transformers (DiTs) as its scalable backbone network.
Although the diffusion policy can represent complex continuous action distributions, OpenVLA-OFT (Kim et al., 2025) has shown in dual-arm tasks that fine-tuning the VLA with an L1 regression objective achieves performance similar to diffusion-based fine-tuning. However, it offers faster training convergence and inference speed. FAST (Pertsch et al., 2025) proposes a new compression-based tokenization scheme for next-token prediction. This method matches the performance of diffusion VLAs, while reducing training time by up to 5x across multiple dexterous manipulation tasks.
Previous methods face a basic tradeoff: VLM backbones are general but slow, while robot visuomotor policies are fast but not general. Synchronizing both does not improve inference speed. Helix (Figure.ai, 2025) and Groot N1 (Bjorck et al., 2025) overcome this tradeoff with two asynchronous complementary systems, trained end-to-end to communicate. However, Groot N1 makes more use of human video latent action and simulated data compared to Helix.
Foundation models assisting for reinforcement learning
Reinforcement learning has garnered widespread attention from researchers due to its ability to explore the environment by not requiring extensive annotated data. However, it also faces numerous challenges, such as dealing with long-horizon sequences, effectively exploring, reusing experience data, and designing reward functions (Kober et al., 2013). Foundation models have demonstrated the emergence of common sense reasoning, the ability to sequence sub-goals and visual understanding. Due to the strong capability of foundation models, many studies aim to leverage the unprecedented capability of foundation models to address the challenges faced by reinforcement learning. RobotGPT (Jin et al., 2024) aims to distill the knowledge of the brain ChatGPT into the mind of a small brain trained with reinforcement learning. At the same time, many studies explore the use of foundation models to solve challenges like long-horizon problems and effectively exploring and designing reward functions.
Norman (Di Palo et al., 2023) employs LLMs to decompose tasks into subgoals and utilizes CLIP to identify the completion of each subgoal, serving as a signal generator for sparse rewards. ROBOFUME (Yang et al., 2024a) employs a fine-tuned VLM as the sparse reward function for the RL algorithm, tackling the issue of the extensive human supervision needed for training or fine-tuning a policy in the real world. Eureka (Ma et al., 2023b) utilizes LLM to craft a reward function for five-fingered hand pen spinning. Subsequently, it engages in a cyclic process encompassing reward sampling, GPU-accelerated reward evaluation, and reward reflection to progressively refine its reward outputs. In contrast to Eureka’s self-iteration and sparse reward function design, TEXT2REWARD (Xie et al., 2023a) incorporates human feedback into the iterative updating of the reward function, yielding a dense reward function. FAC (Ye et al., 2023a) proposes using knowledge from foundation models as policy prior knowledge to improve sampling action efficiency, as value prior knowledge to measure the values of states and as success-reward prior knowledge to provide final feedback on task success.
Summary
According to Figure 7, policies can be classified into VLAC, VLAKP, VLADP, and Foundation Models assisting for Reinforcement Learning. As shown in Appendix A Tables 4 and 5, comparison with baseline approaches reveals key distinctions. The strengths and limitations of VLAC, VLAKP, and VLADP are as described in Table 1. Foundation models for policy. VLAC integrates task instruction, pre-written APIs, and example inputs into LLM. It generates corresponding execution code (Liang et al., 2023; Yoshida et al., 2025). (2) VLAKP without foundation models training inputs task instruction into LLM, which specifies the manipulated object. The observation image is fed into VFM for object segmentation or keypoints, and both the manipulated object and object feature images are input into VLM, outputting the pixel mapping of the object to be manipulated into cartesian 3D key pose (Huang et al., 2023b) or relational keypoint constraints (Huang et al., 2024a). (3) VLAKP with training uses LLM to extract tokens from task instructions. VFM or VLM extract features from multi-view images and generate 3D features through 3D Lift. Finally, these features, along with proprioception, are used as input to the diffusion model to generate the key pose (Ke et al., 2024). (4) VLADP with training (Single-System) outputs dense pose and the hindsight image by inputting task instruction and observation into a pre-trained model after training (Black et al.). The difference between VLADP and VLAKP with training lies in generating dense poses directly through policies, allowing for conversion into trajectory through time sequences, offering a more end-to-end approach compared to key pose. Key pose often requires subsequent motion planning. Outputting dense pose resembles more closely the paradigm of human task execution, as it does not require camera and spatial calibration or robot body configuration. However, it still necessitates extensive data training to embed the parameters of robot execution in the policy model’s hidden layer. As for imagining after the next movement, predicting both the next action and the hindsight image can improve the performance (Bousmalis et al., 2023). (5) The VLADP with training (double-system) generates dense pose by using models with different inference frequencies (Figure.ai, 2025). This method can effectively leverage prior knowledge from VLM and improve inference frequency. (6) Foundation models assisting for reinforment learning. LLM generates subgoals based on task instruction to transform long-horizon tasks into short-horizon ones (Di Palo et al., 2023), facilitating RL learning. LLM also creates a reward function for RL according to task instruction (Ma et al., 2023b), while VLM can utilize prior knowledge to provide predicted action and sparse/dense reward, enhancing the effective exploration in reinforcement learning (Ye et al., 2023a). Strengths and limitations of VLAC, VLAKP, and VLADP.
Manipulation data generation
Demonstration data plays a crucial role in robotic manipulation, particularly in the context of imitation learning (Padalkar et al., 2023). A common approach for gathering such demonstrations is human teleoperation in the real world. However, collecting real-world data often requires significant human labor and specialized teleoperation equipment. Recently, there has been a growing number of excellent developments in low-cost teleoperation hardware, which enables the collection of high-quality demonstration data (Fang et al., 2024; Cheng et al., 2024).
To collect data in real environments, human effort is required for scene setup and data annotation (Sermanet et al., 2023). There are currently two methods for data collection: the bottom-up approach and the top-down step-by-step approach. The bottom-up approach focuses on selecting a task to perform based on the current scene. Then, it uses methods like crowd-sourcing to label the data. The top-down approach involves a step-by-step process where decision-makers assign task labels and manage overheads, such as resets and scene preparations (Sermanet et al., 2023). The robot then performs tasks according to these labels. RoboVQA (Sermanet et al., 2023) shows that the bottom-up approach is more efficient in data collection compared to the top-down step-by-step approach. DIAL (Xiao et al., 2022a) uses a fine-tuned CLIP to replace humans in labeling robot trajectories during bottom-up data collection. This transforms the robot manipulation dataset on the internet into the robot-language manipulation dataset. PAFF (Ge et al., 2023) points out that incorrect robot trajectories can be linked to new tasks and uses fine-tuned CLIP to label the incorrect robot trajectories with appropriate task labels. The above methods demonstrate that high-level cognitive models can assist in data annotation. SOAR (Zhou et al., 2024b) shows that integrating a high-level cognitive model with a low-level control policy can result in a fully autonomous data-collection system in varied real-world environments.
Generating lots of data in simulation is a cheaper solution. However, it still requires human effort to create both scene generation and task execution code for specific tasks (Wang et al., 2023c). Moreover, the notorious sim-to-real gap issue remains a challenge in transferring policies trained in simulation to real-world applications. But there are many methods to address the sim-to-real challenge. Matas et al. (2018) trains the policy fully in simulation through domain randomization and then successfully deployed in the real world, even though it has never encountered real deformable objects. Therefore, simulation plays an important role in manipulation and this section will analyze existing simulators, scene generation, demonstration generation and sim-to-real gap challenge.
Compared to single-frame images and language data on the internet, internet videos contain information on the physics and dynamics of the world, as well as on human behaviors and actions (Chandrasegaran et al., 2024). This information is precisely what is required for manipulation tasks. Therefore, in this section, we also introduce the internet-scale video data for robot learning.
Regardless of whether it’s in a real or simulated environment, improving the efficiency of the existing dataset is essential. The mainstream approach is dataset augmentation.
Low-cost teleoperation device
The current low-cost teleoperation can be categorized into two types: online teleoperation and offline teleoperation. The distinction is similar to the difference between SLAM and SFM. Online teleoperation is a closed-loop interaction between a demonstrator and a robot (Darvish et al., 2023). In the forward process, human motion is measured using devices that combine various sensors, such as vision, IMUs, or multi-joint encoders. The motion data from the demonstrator is then retargeted to the robot’s space. This allows the robot to accurately follow the demonstrator’s demonstrated trajectory. During the backward process, sensor data from the robot, such as forces, torques, and tactile information, should be retargeted to the demonstrator’s space. As a result, the demonstrator can experience an immersive teleoperation environment by sensor data feedback. At the same time, the synchronization and real-time performance between the forward and backward processes are also crucial (Darvish et al., 2023). Offline teleoperation remove the reliance on real robots during data collection compared to online teleoperation (Chi et al., 2024). Demonstrators directly perform tasks using handheld or wearable devices (Chi et al., 2024; Fang et al., 2024; Wang et al., 2024b) or using cameras to record the task execution process (Shaw et al., 2023). They do not need to supervise real robots to complete the tasks and operate tasks using human’s direct view perspective. Therefore, offline teleoperation lacks the backward feedback process. Without relying on real robots, the devices become more portable and intuitive. However, this increases the precision requirements for the retargeting algorithm.
The differences among current low-cost teleoperation devices lie primarily in two aspects. One is human motion measurement on both online teleoperation and offline teleoperation. The other is visual feedback on online teleoperation. Human motion measurement component can be broadly categorized into two classes: one aimed at capturing and mapping the pose of end-effectors (Chi et al., 2024; Cheng et al., 2024; Fu et al., 2024; Li et al., 2020; Liu et al., 2022a), and one exploited devices for joint copy (Fang et al., 2024; Wu et al., 2023c; Zhao et al., 2023a). Visual feedback can be generally classified into two types: third-person view and first-person view (Cheng et al., 2024). The third-person view shows the demonstrator from an external position, offering a broader perspective of surroundings. In contrast, the first-person view mimics the robot’s perspective, providing an immersive and realistic experience such as teleoperation with VR/AR headset.
For approaches capturing and mapping the pose of end-effectors, the common low-cost capturing devices include SpaceMouse (Liu et al., 2022a; Zhu et al., 2023b), cameras (Cheng et al., 2024; Fu et al., 2024; Iyer et al., 2024; Shaw et al., 2023; Li et al., 2019; Fang et al., 2020a), VR controllers (De Pace et al., 2021; Nakanishi et al., 2020), and IMU sensors (Chi et al., 2024; Fang et al., 2017a, 2017b; Li et al., 2020). The SpaceMouse based method passes the position and orientation of the SpaceMouse as action commands of end-effectors. This method is low-cost, easy operation, and easy implementation, but it is not suitable for dual-arm operations. In contrast, methods based on cameras and VR are well suited for bimanual teleoperation and VR offers the advantage of visual feedback compared to cameras. However, teleoperation methods based on cameras and VR heavily relies on the accuracy of pose estimation algorithms and often affected by occlusion (Cheng et al., 2024; Fu et al., 2024; Iyer et al., 2024; Pavlakos et al., 2024). The main advantage of teleoperation devices based on IMU sensors lies in their wearability (Chi et al., 2024; Li et al., 2020; Wang et al., 2024b). Due to this advantage, UMI (Chi et al., 2024) and DexCap (Wang et al., 2024b) developed wearable devices capable of in-the-wild teleoperation and offline data collection.
Above systems work in cartesian space, which needs inverse kinematic (IK) solver and off-the-shelf IK often suffering from fails when operating near singularities of the robot. Although some bilateral teleoperation systems use haptic feedback to provide a tangible sense of the robot’s kinematic constraints, they do not address the challenges of very tight operational spaces (Silva et al., 2009). Therefore, multi-joint encoder teleoperation devices can solve the IK problem by working in the joint space. The current design of multi-joint teleoperation devices is mainly divided into isomorphic and non-isomorphic devices (Wu et al., 2023c). Isomorphic devices refer to teleoperation systems using standard servo-based robotic arms to control manipulators with similar size and kinematics (Zhao et al., 2023a), while non-isomorphic devices use such arms to control manipulators with different size and kinematic properties. Non-isomorphic devices use kinematically equivalent structures based on DH parameters to map joint spaces between different properties (Wu et al., 2023c). AirExo (Fang et al., 2024) expands this low-cost and scalable platform into a wearable device to collect cheap in-the-wild demonstrations at scale.
As for teleoperation visual feedback, most of methods (Fu et al., 2024; Li et al., 2020; Liu et al., 2022a; Wu et al., 2023c; Zhao et al., 2023a; Zhu et al., 2023b) use third side view that observe the robot task with the operator’s own eyes directly. However, this observation involves some visual errors. For example, there may be inaccuracies in the distance between the gripper and the object being manipulated. While for first-person view, due to wearing VR head (Cheng et al., 2024; Iyer et al., 2024; De Pace et al., 2021; Nakanishi et al., 2020), it allows operators to perceive the robot’s surroundings immersively. However, long time to use VR headset can lead to fatigue.
Representative low-cost hardware works.
Simulator
The current mainstream simulators (Zhou et al., 2023) include PyBullet (Coumans and Bai, 2016), MuJoCo (Todorov et al., 2012), CoppeliaSim (Rohmer et al., 2013), NVIDIA Omniverse, and Unity. Pybullet is easy to use and integrate, but its graphics are quite basic. It is not suitable for applications that require complex visual effects. Therefore, Pybullet is often used together with Blender (Shi et al., 2024). Mujoco offers a high-precision physics engine. It is suitable for simulating articulated and deformable object manipulation. However, it has a high entry barrier for beginners. CoppeliaSim offers a wide range of ready-made environments, objects, and prototyping robotic systems for users. However, when dealing with many robots or complex scenes, CoppeliaSim may encounter performance issues. NVIDIA Omniverse provides real-time physics simulation and realistic rendering. However, it requires significant computational resources. NVIDIA Omniverse offers many interfaces. Users can use these to develop various applications. For example, Issac Gym is a platform for robot reinforcement learning, developed using Omniverse. Unity offers rich visual effects and a user-friendly interface. It allows for the creation of highly interactive applications. However, its physics engine is still not precise enough. The basic components of a simulator are the physics engine and the renderer. Improvements in these components can enhance the capability of sensors in simulations, such as optical tactile sensors (Chen et al., 2023d). Learning-based simulators also show great potential. For example, Sora (Brooks et al., 2024) and UniSim (Yang et al., 2023b) use vast amounts of data from the internet to simulate the visual effects of many different actions.
Scene and demonstration generation
Simulation scenes can be created manually. However, this approach is time-consuming and labor-intensive. As a result, automated or semi-automated scene generation methods are more preferred (Deitke et al., 2022). Two methods can be used. Real-to-Sim method converts real scenes to simulation. Automated generation method automatically generates simulation scenes without real-world observation. Real-to-Sim method can accurately mimic the real world, but it limits the diversity of scenes. The automated generation method can create more diverse scenes and increase the variety of collected demonstrations.
The Real-to-Sim method directly refers to a digital twin. The Real-to-Sim method utilizes 3D-reconstruction technology (in Section State) or inverse graphics (Chen et al., 2024c) to create the real-world scene in a virtual environment (Torne et al., 2024). But, 3D reconstruct scene is static environment where objects lack real-world physical properties, such as material, mass and friction coefficients, and are non-interactive (Torne et al., 2024). The inverse graphics method, such as URDFormer (Chen et al., 2024c), directly generates interactive simulation environment and articulated objects from input RGB image. Compared to 3D-reconstruction methods, it reduces human involvement and produces interactive simulation environment. However, it lacks physical plausibility and fails to address the mismatch between the generated object’s physical properties and the real world.
As for the application of foundation models in Real-to-Sim methods, GRS (Zook et al., 2024) employs SAM2 for object segmentation from RGBD image and utilizes VLMs to describe and match objects with simulation-ready assets. This approach combines the strengths of 3D-reconstruction and inverse graphics methods. It ensures the credibility of 3D-reconstruction methods and allowing objects in the scene to interact. However, it is impossible for the assert dataset to fully cover objects in the real world. Constructing an interact assert dataset often requires manual design by the creator or human-assisted interactive object generation. ACDC (Dai et al., 2024) defines a digital cousin concept. Unlike a digital twin, it does not directly replicate a real-world counterpart. However, it retains similar geometric and semantic features by using similar asserts when the assert dataset does not include real-world objects. As for object pose, depth cameras are commonly used, but they struggle to capture reflective surfaces accurately. This limits their use in the wild. To address this issue, ACDC uses Depth-Anything-v2 (Yang et al., 2024b), a state-of-the-art monocular depth estimation model, to estimate the depth map.
Scene diversity primarily includes the diversity of scene layouts, such as floor plans and object placements, as well as the diversity of objects. The automated generation methods are more effective for producing large-scale diverse scenes. The automated generation methods can be categorized into rule-based and learning-based approaches. For instance, ProcTHOR (Deitke et al., 2022) introduces a procedural generation pipeline for interactive scenes using rule-based constraints and statistical priors. However, the generated scenes often rely on pre-defined priors, resulting in unrealistic outcomes that hinder agent learning (Khanna et al., 2024). To address this, PHYSCENE (Yang et al., 2024c) incorporates physical collision avoidance, object layouts, interactivity, and reachability metrics into a diffusion model. This approach enhances the physical plausibility and interactivity of generated scenes.
Due to the prior knowledge of foundation models, there are current efforts to use foundation models for scene construction. RoboGen (Wang et al., 2023e) utilizes LLM to generate relevant assets, asset sizes, asset configuration, scene configuration based on the task proposals and use text-to-image-to-3D generation to create the corresponding assets. These assets are imported into the simulator to generate the appropriate scene. Finally, VLM is used for task-specific scene verification. GenSim (Wang et al., 2023c) uses LLMs to generate new task and task scenario codes based on the pre-cached scene codes in a task library. However, using foundation models to automate the generation of scene’s physical plausibility still relies on VLM for judgment. At the same time, the above research also uses LLMs to generate diverse instructions to ensure task diversity. However, generating diverse task instructions with LLMs presents challenges in ensuring rationality for the current environment.
The Real-to-Sim method and the Automated generation method both rely on 3D assets. The diversity of 3D assets determines the variety of scenes (Nasiriany et al., 2024). Although there are many existing 3D object assets (Calli et al., 2017; Chang et al., 2015; Deitke et al., 2023; Li et al., 2023a; Liu et al., 2022b; Geng et al., 2023a; Xiang et al., 2020), their quantity is far from sufficient to cover the variety of real-world objects. As a result, many studies focus on the automatic generation of assets, such as zero-1-to-3 (Wang et al., 2023e), Luma. ai (Nasiriany et al., 2024), LLaMA-Mesh (Wang et al., 2024d), and Trellis (Xiang et al., 2024). However, the performance of generative models is also limited by the shortage of current 3D training data. To address this issue, data cleaning techniques or manual supervision are needed to filter and select high-quality generated object assets.
The modeling of the interaction environment above primarily focuses on articulated object modeling. Articulated objects can be created manually by designers or generated using procedural (Jiang et al., 2022; Liu et al., 2023b; Zhang et al., 2023e) or human-assisted interactive methods (Torne et al., 2024) after 3D scanning. They can also be generated automatically through inverse graphics (Chen et al., 2024c) or generative model (Xiang et al., 2024). However, current automated methods for generating articulated object assets are limited to objects with few rotational joints. Real2Code (Mandi et al., 2024) fine-tunes a CodeLlama model to process visual observation descriptions and then outputs joint predictions. This enables Real2Code to reconstruct complex articulated objects with up to 10 parts. At the same time, generative models mainly focus on rigid and articulated objects and research on deformable objects remains insufficient (Sundaresan et al., 2022).
To collect demonstrations in simulations, different approaches can be used based on task complexity. For simple tasks, like a two-finger gripper picking up a cube, a hard-coding method (Wang et al., 2022b) can be used. However, for more complex tasks, remote teleoperation (Chen et al., 2024a) or skill library (Ha et al., 2023) should be employed. Building skill library can be done using reinforcement learning or gradient optimization methods. RoboGen (Wang et al., 2023e) shows that gradient-based trajectory optimization is better for fine-grained manipulation tasks with soft bodies, like shaping dough into a specific form. On the other hand, reinforcement learning and evolutionary strategies are more effective for contact-rich tasks and continuous interactions with other components in the scene.
Sim-to-real gap solutions
The sim-to-real problem is a widespread issue across machine learning, not limited to manipulation (Zhao et al., 2020). The goal is to successfully transfer the policy from the simulation (source domain) to the real world (target domain). The gap in the manipulation tasks between the simulation and the real-world includes two main types: visual gap and dynamic gap. Visual gap refers to the difference between the vision information produced by the renderer and the vision information in the real world. In terms of rendering realism, BEHAVIOR-1K (Li et al., 2023a) highlights that Omniverse offers the highest rendering performance. The dynamic gap consists of several factors. First, there is a difference between the physics engine used in simulations and real-world physics. Second, the properties of objects, including robots, contribute to the object dynamic gap. Lastly, there is a control gap in robots, such as variations in static errors caused by different PID parameters. Currently, there are three main approaches to address sim-to-real gap: system identification, domain randomization, and transfer learning (Zhao et al., 2020).
Most of the system identification research (Kristinsson and Dumont, 1992) aims to create an accurate mathematical model of a physical system to make the simulator more realistic. However, it is impossible to accurately build models of complex environments in simulators. The primary methods for physical parameter identification include estimation from interaction (Seker and Kroemer, 2024; Bohg et al., 2017; Xu et al., 2019), estimation from demonstrations (Torne et al., 2024), and estimation from observations using foundation models (Gao et al., 2023). Among these, estimation from demonstrations appears more effective. Demonstrations inherently contain interaction information and can also assist policy training. However, improving the hardware performance for collecting demonstrations remains essential.
Domain randomization (Ramos et al., 2019) involves adding random disturbances to the parameters in simulation. This can include various elements, generally divided into visual and dynamic randomization. Visual randomization covers visual parameters like lighting, object textures, and camera positions. Dynamic randomization covers dynamic parameters like object sizes, surface friction coefficients, object masses, and actuator force gains. By experiencing diverse simulated environments, the policy can adapt to a broad range of real-world conditions. For the policy, the real world is essentially just another disturbed environment. However, parameter randomization requires human expertise. Ma et al. (2024) demonstrates that LLM excels in selecting randomized parameters and determining the randomization distribution. This makes domain randomization more automated.
Transfer learning (Tan et al., 2018; Yu and Wang, 2022) involves using limited real-world data to adapt a policy trained on a abundant simulation data to the real world. Treat policies in the real-world and in the simulation as different tasks. We can use task transfer methods for transfer learning. For example, Rusu et al. (2017) uses the progressive network to apply knowledge from a policy trained in simulation to a new policy trained with limited real-world data, without losing the previous knowledge. Treat the policies in the real-world and in the simulation as the same task, even though the data distributions differ. We can use domain adaptation methods to address this issue. Three common methods for domain adaptation are discrepancy-based (Lyu et al., 2024), adversarial-based (Eysenbach et al., 2020), and reconstruction-based methods (Bousmalis et al., 2016). Discrepancy-based methods measure the feature distance between the source and target domains using predefined statistical metrics. This helps to align their feature spaces. Adversarial-based methods use a domain classifier to determine whether features come from the source or target domain. Once trained, the extractor can produce features that are invariant across both domains. Reconstruction-based methods also aim to find shared features between domains through setting up an auxiliary reconstruction task and using the shared features to recover the original input.
The methods discussed above assume that the target domain remains unchanged. However, many physical parameters of the same robot can change significantly. Factors like temperature, humidity, positioning, and wear and tear over time can all affect these parameters. This makes it harder to bridge the sim-to-real gap. To address this issue, DORA (Zhang et al., 2024c) uses an information bottleneck principle. It aims to maximize the mutual information between the dynamics encoding and environmental data. At the same time, it minimizes the mutual information between the dynamics encoding and the behavior policy actions. Transic (Jiang et al., 2024a) proposes a data-driven approach that enables successful sim-to-real transfer using a human-in-the-loop framework.
Internet-scale video dataset
Extensive and diverse video datasets are available from online repositories. The collection process requires querying and searching for videos with relevant content. After that, low-quality video data is removed through data cleansing. However, the raw video data cannot be directly transferred into the manipulation model due to the absence of (1) action or reward labels; (2) distribution shifts including physical embodiments, camera viewpoints, and environments. Although AVID (Smith et al., 2019) and LbW (Xiong et al., 2021) translate human action images from videos into robot action images, this type of translation remains limited to the pixel level; (3) essential low-level information like tactile feedback, force data, proprioceptive information, and depth perception (McCarthy et al., 2024). However, these raw videos contain extensive visual information, such as objects, spatial information, human activities, and sequences of interactions between humans and objects (Eze and Crick, 2024). At the same time, language annotations are essential to support learning of semantic features in this visual information.
Methods to obtain language annotations are divided into manual and automated captions. Manual captions are created by humans labeling video content. Automated captions include four types: (1) Automatic Speech Recognition (ASR), which converts audio in videos to text (Xue et al., 2022). (2) Alt-text, which collects captions from HTML alt-text attributes in web images and videos, like descriptions, tags, and titles (Bain et al., 2021). (3) Transfer, which starts with a set of image-caption pairs. Then, captions are matched to video clips with similar frames (Nagrani et al., 2022). (4) Foundation Models, which use pre-trained models to get captions. For example, VLMs provide single-frame image captions, while LLMs filter out inconsistent captions across frames (Blattmann et al., 2023). Owing to recent advancements in language annotation techniques, most widely used internet video datasets incorporate language annotations, such as InternVid (Wang et al., 2023d), HD-VILA-100M (Xue et al., 2022), YT-Temporal-180M (Zellers et al., 2021), WTS-70M (Shvetsova et al., 2025), HowTo100 M (Miech et al., 2019), WebVid-10M (Nan et al., 2024), and VideoCC3M (Yan et al., 2022). At the same time, various off-the-shelf models can be used to annotate the current video with additional labels, such as pose (Shaw et al., 2023), affordance (Mendonca et al., 2023), key points trajectory (Wen et al., 2023b), latent action (Ye et al., 2024), and mask and bounding boxes (Shan et al., 2020).
The task information contained in internet video data may not be highly relevant to the specific tasks performed by robots. Additionally, internet video data often suffers from issues such as missing action labels, low-level information, and distribution shifts. Therefore, manually recording custom videos can be an effective approach to collecting videos that are directly relevant to specific robot tasks or embodiments. This method can also help avoid the issue of re-annotating. By incorporating sensors such as IMUs, tactile sensors, and depth sensors during the recording process, manually recorded custom videos can exhibit lower noise compared to internet video data. However, the scale and diversity of manually recorded videos still cannot match the internet video data (McCarthy et al., 2024). Currently, there are several commonly used manually recorded video datasets, such as Ego-4D (Grauman et al., 2022), Ego-Exo-4D (Grauman et al., 2024), RoboVQA (Sermanet et al., 2024), Epic-Kitchens-100 (Damen et al., 2022), and ActionSense (DelPreto et al., 2022).
Dataset augmentation
Current data augmentation can be mainly divided into scene-level and object-level. Scene-level refers to changing the layout of objects in the scene. For example, MimicGen (Mandlekar et al., 2023) and DexMimicGen (Jiang et al., 2024c) change the positions and orientations of objects, while CACTI (Mandi et al., 2022) adds new, artificial objects to the scene. However, the reliability of data augmentation still needs validation. For example, MimicGen (Mandlekar et al., 2023) filters data generation attempts based on task success. Current foundation models for dataset augmentation methods primarily operate at the object level. The main idea is to use semantic segmentation to extract masks for each object, and then employ generative rendering methods to alter the object’s texture. GenAug (Chen et al., 2023c) leverages language prompts with a generative model to modify object textures and shapes, adding new distractors and background scenes. ROSIE (Yu et al., 2023) localizes the augmentation region with an open vocabulary segmentation model and then runs image editor to perform text-guided image editing.
Summary
Following Figure 8, LLMs can generate credible descriptions or code for task scenes. VGMs produce 3D object meshes and render textures. Nonetheless, the validity of the generated task scenes must be ultimately assessed by VLMs. For scene generation, Automated Generation Method ensures higher diversity than Real-to-Sim. The realism of simulation depends on the simulator. Omniverse provides the best rendering performance. Foundation models for manipulation data generation. Current mainstream simulators include Pybullet, MuJoCo, CoppeliaSim, NVIDIA Omniverse, and Unity. Meanwhile, learning-based generative models used as simulators have shown potential. Simulation environment generation can be classified into real-to-sim and automated generation methods. in the real-to-sim methods, assuming the object’s position is known, the main challenge lies in constructing the object’s 3D mesh. This can be achieved through scanning technique or by using VGM to generate the 3D mesh directly from RGB image (Chen et al., 2024c). Additionally, GRS (Zook et al., 2024) utilizes VLM to extract 3D object meshes corresponding to real-world object from assert database based on RGB image. In the Automated Generation methods, LLM can output scene descriptions or scene code based on task instruction. When the output is a scene description, VGM generates the objects and arranges them according to the description. Meanwhile, the generated scene need to be evaluated by VLM. (Wang et al., 2023e). When the output is scene code, it directly generates the corresponding scene (Wang et al., 2023c). However, this requires substantial prior knowledge of scene code within the task library. There are three methods for generating demonstrations in a scene: Hard-code, Teleoperation, and Skill Library. When building skill library, gradient optimization is effective in training skill for deformable tasks and reinforcement learning works better for contact-rich tasks (Wang et al., 2023e). Solutions for the Sim-to-Real gap include System Identification, Domain Randomization, and Transfer Learning. For data augmentation, VFM is used to segment images first, and then VGM renders the object’s texture on the masked image.
Discussion
In this survey, we aim to outline the opportunities brought by foundation models for general manipulation. We believe the potential of embedding foundation models into manipulation tasks as a viable path toward achieving general manipulation. However, the primary applications of LLMs, VFMs, VLMs, LMMs, and VGMs focus only on certain aspects of general manipulation capability, such as reasoning, perception, multimodal understanding, and data generation. The current framework for RFMs demands extensive data for learning, posing a crucial issue of constructing a data close-loop, and ensuring over a 99% success rate remains an unresolved concern. Therefore, this paper proposes a framework of robot learning for manipulation toward achieving general manipulation capability and detailing how foundation models can address challenges in each module of the framework. However, there are still many open questions in this survey. In this section, we delve into several open questions that we are particularly concerned about.
What is the framework for general manipulation?
Definition of general manipulation
The ultimate general manipulation framework should be able to interact with human or other agent and control whole-body to manipulate arbitrary objects in open-world scenarios and achieve diverse manipulation tasks. However, the interaction between robot and human involves not only recognizing intentions but also learning new skills or improving old skills from human experts in the external world. Open-world scenarios may be static or dynamic. Objects can be either rigid or deformable. Task objectives can vary from short-term to long-term. Furthermore, tasks may necessitate different degrees of precision with respect to contact points and applied forces/torques. We designate the restriction of the robot’s learning capability to improving old skills and to manipulating rigid objects in static scenes in order to achieve short-horizon task objectives with low precision requirements for contact points and forces/torques as Level 0 (L0), the current research has a high probability of achieving L0. However, safety and accuracy remain paramount concerns.
The design logic of the framework in this survey
Based on the general manipulation definition and robot learning development history, this paper proposes a framework for a general manipulation capability. Given that the scenarios are static, the framework is designed in a modular, sequential manner. To facilitate module migration, it is preferable for each module to be plug-and-play. Given the current reliance on human-in-the-loop mechanisms in autonomous driving and medical robotics to ensure safety, this framework aims for human–robot interaction through corrective instruction to ensure the safety of manipulation actions. The corrective action can be collected into the dataset and then improve old skills through offline training.
The proposed framework limitations
(1) The framework is designed with a sequential structure, which contrasts with the parallel execution in human operation. (2) Both the proposed framework and the surveyed literature are based on learning-based approaches. While model-based methods may not generalize as well, they tend to significantly outperform learning-based methods in terms of success rates, precision and safety for specific tasks (Pang et al., 2023). Therefore, investigating the integration of learning-based and model-based approaches remains an important research. (3) The framework proposed in this paper is based on the development of learning-based methods and the definition of general manipulation. The framework of brain-like cognitive research should also be explored.
Product implementation strategy
During robot execution, continuous human supervision is not always feasible. Hence, integrating real-time monitoring through parallel surveillance videos during robot execution could enhance safety. The framework in this paper does not explicitly denote this parallel safety monitoring module, as it resembles the post-conditions detection module. The post-conditions detection module analyzes the robot’s execution video to identify reasons for task failure, facilitating post-hoc correction to ensure task success. If the algorithm’s task execution safety is 80%, and the monitoring module predicts safety at 80% as well, the probability of risky movements reduces to 4%. Of course, for household robots, ensuring an over 99% safety rate is imperative. Initially, cloud-based monitoring of multiple robots by a single operator, with human intervention to correct erroneous behaviors, appears to be the best approach. This strategy not only reduces labor requirements but also ensures safety. Later, the model’s accuracy reaches over 99% by gathering extensive data.
What kind of learning capability should a general manipulation framework possess?
The importance of learning ability
As an intelligent robot for general manipulation, it is inevitable that one cannot learn all the skills of an open-world during offline development, hence possessing a certain learning capability is necessary (Wang et al., 2024). Within the framework of this paper, a module of corrective instruction is introduced, enabling the robot to rectify its actions. These corrective demonstrations are incorporated into the manipulation dataset and used to improve the policy offline through fine-tuning. However, this approach still focuses on learning old task skills and cannot acquire new ones.
Definition of learning ability
The model of Policy should possess the capability of interactive, few-shot, continue, online learning to acquire a new skill and reinforce the policy’s mastery of the newly learned skill through corrective instruction offline. Interactive refers to the ability to learn through human demonstration or by observing instructional videos. Learning through demonstration often requires physical control or teleoperation, which is less natural. Learning through observation of instructional videos aligns better with human learning patterns. However, when humans learn from teachers, they often do not predict the teacher’s trajectory but rather understand the high-level description of the actions, akin to VLaMP (Patel et al., 2023). Few-shot continue learning enables the robot to learn new skills with minimal demonstrations without forgetting previously learned skills. Online learning entails processing observed data instantly and enabling the model to learn as quickly as possible.
What foundation models bring for general manipulation?
The emergence of foundation models can promote the progress of general manipulation. Meanwhile, for each section, we summarize the contributions of foundation models. As for Interaction, compared to traditional methods that use fixed questioning templates to eliminate instruction ambiguity, foundation models can provide the following for ambiguous instructions and corrective instructions: (1) more natural language communication, (2) multimodal perception to detect more types of ambiguity, and (3) powerful prior knowledge to better understand user intent. As for Object Affordance and Object Recognition in Pre- and Post-conditions Detection, foundation models bring several advantages. (1) They provide the perception capabilities for open-set affordance, detection, and segmentation, enabling zero-shot recognition of novel cases. (2) The powerful prior knowledge of foundation models accelerates the learning of tool affordance. (3) Foundation models assist methods in better understanding affordance and selecting more accurate poses.
As for the hierarchy of skills: (1) Foundation models can assist in processing and interpreting natural language inputs. (2) The acquisition of world knowledge and commonsense reasoning by foundation models enhances their perception and reasoning abilities. This has the potential to improve the scalability and generalizability of tasks within the skill hierarchy. As for 3D Reconstruction and 6D Pose Estimation in State. (1) Foundation models assist in reconstructing scenes with semantic information. (2) Foundation models’ powerful 2D feature extraction ability can be applied to 3D lifting, aiding in the extraction of 3D features. (3) Foundation models enable open-set pose estimation.
As for policy: (1) Foundation models can help the model follow instructions better. (2) Foundation models can enhance the model’s generalization ability and assist reinforcement learning. (3) Foundation models trained on large manipulation data can transfer prior knowledge to new task, such as PI0 transferring the mistake correction ability of pre-trained datasets to new task. As for manipulation data generation, the main contributions of foundation models are in simulation data and data augmentation. (1) Foundation models can generate 3D mesh assets in a zero-shot manner. (2) Foundation models help create diverse simulation scene layouts. (3) The vast priors of foundation models can be applied to data augmentation.
How to use internet-scale video data for RFMs?
As for what information from video dataset can be used, there are six main types of information to convert from video datasets: (1) Pose, such as capturing human hand poses and retargeting them to dexterous hand poses (Shaw et al., 2023; Qin et al., 2022). (2) Affordance, including grasp locations on objects and post-grasp waypoints (Mendonca et al., 2023). (3) Motion information, explicitly includes keypoints trajectories of objects and human hand during actions (Wen et al., 2023b; Xiong et al., 2021; Yuan et al., 2024) and implicitly includes using VQ-VAE (Van Den Oord et al., 2017)) to generate a codebook for latent delta action (Ye et al., 2024). (4) Environment transition dynamic information, such as predicting hindsight images after completing the current action (Cheang et al., 2024; Wu et al., 2023a; Yang et al., 2023b). (5) Semantic information, such as descriptions of current task steps (Wang et al., 2024a) and task instruction (Jain et al., 2024). (6) Spatial and texture information, such as MVP (Radosavovic et al., 2023) suggests using masked autoencoding (He et al., 2022) for improving visual reconstruction.
As for how to extract these useful information, various off-the-shelf models can be used to annotate the current video with additional labels, such as pose (Shaw et al., 2023), affordance (Mendonca et al., 2023), key points trajectory (Wen et al., 2023b), latent action (Ye et al., 2024), mask, and bounding boxes (Shan et al., 2020). When adding various labels to the video dataset, different training objectives can be used to extract features from the video dataset, such as MAE (Radosavovic et al., 2023), contrastive learning (Ma et al., 2022), time contrastive learning (Ma et al., 2023a), temporal-difference learning (Bhateja et al., 2023), video prediction objective (Du et al., 2024), affordance prediction objective (Mendonca et al., 2023), video-language alignment objective (Nair et al., 2022), action motion objective (Yuan et al., 2024), or combinations of these objectives (Karamcheti et al., 2023; Zhou et al., 2021).
As for how to utilize the extracted information to enhance or train robotic foundation models, the current robotic foundation models primarily use two learning methods: imitation learning and reinforcement learning. Therefore, the discussion on the third issue focuses on leveraging prior knowledge from video datasets in these two methods. As for imitation learning, when the robotic foundation model outputs pose and the video dataset annotated with pose label, the video dataset can be directly used as training data for the robotic foundation model (Kareer et al., 2024; Qin et al., 2022; Shaw et al., 2023). When leveraging affordance information, motion information, environment transition dynamics information, semantic information, spatial and texture information, it is essential to employ GMM & CEM (Mendonca et al., 2023), Inverse Dynamic Model (IDM) (Du et al., 2024; Ye et al., 2024; Wen et al., 2023b), and Decoder (Cheang et al., 2024; Wang et al., 2023a; Wu et al., 2023a; Xiao et al., 2022b) to transform these information into actions. Compared to other types of information, semantic information can serve as input not only on the observation side but also on the task-instruction side (Jain et al., 2024; Jang et al., 2022; Shah et al., 2023). At the same time, semantic information can also be used to organize tasks into a hierarchy of skills (Wang et al., 2024a).
As for reinforcement learning, the environment transition dynamics can be used as a transition model (Yang et al., 2023b). The encoder, trained on a video dataset with various objectives, can measure the distance between cross-embodiment actions, which then serves as the reward function or value function (Bhateja et al., 2023). For example, Guzey et al. (2024) and Xiong et al. (2021) use key points motion information to construct the encoder, which serves as the reward function for reinforcement learning. Since distribution shifts exist between cross-embodiment actions, AVID (Smith et al., 2019) and LbW (Xiong et al., 2021) translate human action images from videos into robot action images. However, this translation is limited to the pixel level.
Current research focuses on different types of information in video datasets. The methods for extracting and using this information vary. It is important to consider which information from video datasets should be robustly applied to robotic foundation models. Video is similar to how humans perceive the world. Humans can improve their skills by watching experts. Similarly, using video datasets to construct a reinforcement learning from human feedback (RLHF) system in robotic foundation models is worth exploring (Luo et al., 2024).
How to uses foundation models for post-conditions detection and post-hoc correction?
The current data collection mostly focuses on gathering successful task execution data, ignoring the collection of data related to failed task executions. However, if data on failed task executions are collected and annotated with corresponding error reasons, it would be possible to train a model to both determine task execution success and analyze the reasons for task execution failure. AHA (Duan et al., 2024) trains a VLM to assess failures and output the reasons for these failures. However, the categories of failure modes are still limited, and it cannot output more open-ended failures, such as collaboration errors in dual-arm tasks. Many current studies use internet video data to improve the generalization of policies. Exploring the use of internet video data to enhance post-condition detection and employing multimodal perception to more accurately identify the reasons for failures is a promising direction. Post-hoc correction could then generate corrective action sequences based on the reasons for task execution failure and the task objectives, which would be handed over to a policy to generate corresponding corrective actions.
How to use foundation models for end-effector design?
Currently, there are two primary approaches to designing end-effector. The first approach customizes the end-effector for specific tasks. The second approach makes the multi-fingered end-effector resemble a human hand. The end-effector designed with the first approach is usually easier to control because it has fewer degrees of freedom compared to the end-effector designed with the second approach. In Billard and Kragic (2019), dexterity is divided into two types: extrinsic dexterity and intrinsic dexterity. Extrinsic dexterity involves using external support, such as friction, gravity, and contact surfaces, to compensate for the lack of degrees of freedom. Intrinsic dexterity refers to the hand’s ability to manipulate objects using its own degrees of freedom. Therefore, the first approach still has certain limitations for general manipulation.
The first approach requires manual design, extensive testing, and continual adjustments. In Stella et al. (2023), LLMs are used for designing end-effector. However, this area is still in its early exploration stages. Using LLMs for end-effector design generates text descriptions, which still need to be manually translated into designs. This process is not fully automated. If we could develop modules for rotational and translational joints, and use something like protein structure prediction networks (Jumper et al., 2021), training a foundation model to output graph including these joints could help reduce the challenges of manual design. As for the second approach, the human hand has many sensors and actuators. This makes it nearly impossible to design a robotic hand that closely resembles the human hand. Therefore, it’s essential to design the sensors and actuators carefully.
How to use foundation models for dexterous manipulation?
One major challenge in data collection for dexterous manipulation lies in gathering data from multi-fingered end-effectors. Although model-based hard code method (Zhu et al., 2024) can collect data on dexterous manipulation, they still require data analysis such as mutual information (Hejna et al., 2025) and entropy (Zhu et al., 2024) to assess the quality of the data. Additionally, for multi-scenario and multi-task data collection, teleoperation methods are less dependent on algorithm performance compared to model-based hard code methods. However, online teleoperation requires a real-robot system, which is not portable and cannot achieve in-the-wild data collection. Therefore, current mainstream research focuses on directly tracking human hand motions during manipulation without controlling the real-robot (Wang et al., 2024b).
Two main learning-based methods for dexterous manipulation are imitation learning (Ze et al., 2024a) and reinforcement learning (Ma et al., 2023b). Imitation learning can use a visual encoder (in Section Pre- and post-conditions detection) for visuo-motor control. Diffusion policy (Chi et al., 2023) adapts the concept of diffusion to visuo-motor control. It addresses challenges in visuo-motor control such as action multimodality, sequential correlation to accommodate high-dimensional action sequences. It can also use existing RFMs for fine-tuning (in Section Policy). Fine-tuning with RFMs allows a skill to work in an open world. This often performs better on unseen objects compared to visuo-motor control (Brohan et al., 2023).
Reinforcement learning offers exploration capability, which address suboptimal issues. This advantage distinguishes it from imitation learning. However, reinforcement learning is primarily trained in simulation. It still has limitations in addressing the sim-to-real challenge of complex tasks, such as pen-spinning. In Section Policy, the use of foundation models to assist reinforcement learning is introduced. FAC (Ye et al., 2023a) offers potential for training reinforcement learning in real-world environment, but it still lacks consideration of environment resets (Gupta et al., 2021) and safety. Therefore, using foundation models to assist reinforcement learning in real-world training requires further exploration.
Current learning methods each have their strengths and weaknesses (Zhang et al., 2024a). Therefore, learning approaches for dexterous manipulation should integrate different methods. For example, diffusion policy can assist reinforcement learning in addressing high-dimensional action spaces issue, while reinforcement learning can help diffusion policy overcome issues with suboptimal and negative data. Additionally, the learning models should consider both inputs and outputs. The factors necessary for achieving dexterous manipulation are summarized in Appendix D.
How to use foundation models for whole-body control?
The above discussion primarily focuses on the contact between the end-effector and the object. However, whole-body control is still needed in dexterous manipulation. For example, in a polishing robot, force-position hybrid control of the robotic arm is often required to manage the trajectory of contact points and forces/torques. Mobile manipulation is essential for dexterous manipulation reachability. This idea is inspired by how humans handle objects. For example, when playing badminton, people use their waists, shoulders, elbows, and wrists together to hit the shuttlecock further. This aspect is often overlooked by current foundational models for manipulation. Although LEO (Huang et al., 2023a) can provide poses for both navigation and manipulation, it still does not address the synchronization issue between the two.
For whole-body control, the focus is on low-level control issues. A straightforward idea is to expand the action space of the policy model to include all joints of the robot. However, as the output dimensions increase, end-to-end training methods are more likely to diverge. Therefore, most current models output cartesian space poses and force/torques. These outputs are then optimized and converted into position or torque for each joint through a post-processing module (Haviland and Corke, 2021). To address end-to-end whole-body control issues, principal research is needed to facilitate network training and deployment.
How to establish a benchmark?
Current research on foundation models for manipulation focuses on various tasks, including interaction, hierarchical tasks, perception, detecting pre- and post-conditions, policy, and manipulation data generation. Therefore, a benchmark for foundation models for manipulation should include a comprehensive framework with diverse tasks. This framework should test individual tasks and tasks that involve connecting different modules. Since different simulators have unique physics engines and renderers, the benchmark should include standardized simulators and datasets.
Representative benchmarks.
As for the RFMs are not transferable between different robots: The issue arises from focusing solely on testing RFM algorithms without considering hardware, which is an ineffective approach. General manipulation requires whole-body control. Thus, evaluating the generalization and success rate of RFMs should involve both algorithms and hardware, unlike in computer vision where only algorithms are considered. To address this, the simulation benchmark should include an easy interface for importing various robot hardware configurations.
As for the requirement of a wide range of scenes and tasks: Although iGibson (Li et al., 2021) and BEHAVIOR-1K (Li et al., 2023a) support simulating a variety of household tasks with high realism, they are still manually constructed. In Section Manipulation data generation, we discuss how foundation models can automate the generation of scenes and tasks. Using foundation models to create numerous scenes, combined with VLMs for accuracy checking and minimal human intervention, could be a valuable approach to explore.
As for the metric for assessing general manipulation: The current evaluation standards mainly focus on success rates. However, in real-world applications, other metrics should also be considered. For instance, the system’s real-time performance is important. Most algorithms focus on building the generalization of skills. They often overlook the amount of data and speed required for RFMs to learn a new skill. Therefore, evaluation metric should also include the learning ability of RFMs.
Overall, to assess the ability for general manipulation, methods used for testing medical robots can be referenced. Start with extensive testing in simulation environments, followed by limited tests in real-world settings. Continue evaluating the general manipulation capability during the product’s application phase.
Conclusion
The impressive performance of foundation models in the fields of computer vision and natural language suggests the potential of embedding foundation models into manipulation tasks as a viable path toward achieving general manipulation capability. However, current research lacks consideration of a general manipulation framework. Thus, this paper proposes a general manipulation framework based on the development of robot learning for manipulation and the definition of general manipulation. It also describes the opportunities that foundation models bring to each module of the framework.
We designate the restriction of the robot’s learning capability to improving old skills and to manipulating rigid objects in static scenes in order to achieve short-horizon task objectives with low precision requirements for contact points and forces/torques as Level 0 (L0), the current research has a high probability of achieving L0.
Then, we discuss the following points: (1) the logic and implementation strategies of the designed framework, (2) the learning capability required for general manipulation, (3) what foundation models bring for general manipulation, (4) how to use internet-scale video data for RFMs, (5) how to uses foundation models for post-conditions detection and post-hoc correction, (6) how to use foundation models for end-effector design, (7) how to use foundation models for dexterous manipulation, (8) how to use foundation models for whole-body control, and (9) how to establish a benchmark.
Additionally, the proposed framework has certain limitations: (1) The framework is designed with a sequential structure, which contrasts with the parallel execution in human operation. (2) Both the proposed framework and the surveyed literature are based on learning-based approaches. While model-based methods may not generalize as well, they tend to significantly outperform learning-based methods in terms of success rates, precision and safety for specific tasks (Pang et al., 2023). Therefore, investigating the integration of learning-based and model-based approaches remains an important research. (3) The framework proposed in this paper is based on the development of learning-based methods and the definition of general manipulation. The framework of brain-like cognitive research should also be explored.
Finally, foundation models present opportunities for each module of the framework, but many challenges still remain: (1) (2) (3) (4) (5) (6) (7)
Footnotes
Acknowledgments
The authors thank the editors and anonymous reviewers for the constructive feedback and in-depth review of this work. This paper pays tribute to the researchers and engineers who tirelessly advance the field of robotics — let’s change the world.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was funded by the National Natural Science Foundation of China under Grant 62536001 and 62173197, and was also supported by the National Natural Science Fund for Key International Collaboration 62120106005.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
