DeepMind, the research organization owned by Google, has developed a new robotics model called RT-2 that combines vision, language, and action to enable machines to understand and execute instructions in real-time. The model is trained on a combination of images, text, and coordinate data from a robot’s movement in space. It can then generate a plan of action and the necessary coordinates to complete a given command. This breakthrough could revolutionize how humans interact with robots and pave the way for more intuitive and efficient communication between humans and machines.
The RT-2 model, also known as the “robotics transformer,” builds upon previous efforts by DeepMind, including PaLI-X and PaLM-E, which are vision-language models that combine text and image data. While these models focused on tasks such as captioning images or answering questions about them, RT-2 takes it a step further by generating not only the plan of action but also the coordinates of movement in space. This integration of language, vision, and action is a significant milestone in robotics, as it eliminates the need for low-level programming and allows for a more seamless and natural interaction between humans and machines.
One of the key insights of the RT-2 model is treating robot actions as another language. By representing robot actions as coordinates in space, known as degrees of freedom, the model can incorporate them into its training alongside language tokens and image tokens. This means that the actions of the robot become just another part of the overall instruction, allowing the model to generate meaningful actions based on the input it receives. The use of coordinates as a language for robot actions opens up new possibilities for how robots can be controlled and instructed.
To train the RT-2 model, DeepMind uses a combination of image and text data, as well as actions extracted from recorded robot data. The model is trained to generate actions based on a given prompt, such as picking up an object that is different from the others. The output of the model includes both a plan of action and a series of coordinates in space to carry out those actions. This capability allows the model to generalize to a variety of real-world situations and perform tasks that require reasoning, symbol understanding, and human recognition.
In tests against previous versions of the model and other programs, RT-2 using either PaLI-X or PaLM-E has shown significant improvements in its proficiency. It can interpret relations between objects, determine which object to pick and where to place it, and even re-purpose pick and place skills learned from robot data to place objects near semantically indicated locations. These emergent capabilities demonstrate the potential of the RT-2 model to handle complex tasks and adapt to new situations.
The development of the RT-2 model represents a major step forward in the field of robotics and human-machine interaction. By combining vision, language, and action, this model enables machines to understand and execute instructions in a more intuitive and efficient manner. It has the potential to revolutionize various industries, including manufacturing, healthcare, and logistics, where robots play a crucial role. As researchers continue to push the boundaries of AI and robotics, we can expect further advancements that will shape the future of technology and automation.