Imagine you’re visiting a friend abroad and you look inside their fridge to see what would make for a great breakfast. Many objects initially seem foreign to you, each encased in unfamiliar packaging and containers. Despite these visual distinctions, you start to understand what each is used for and pick them up as needed.
Inspired by humans’ ability to handle unfamiliar objects, a group in MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) designed feature fields for robotic manipulation (F3RM), a system that blends 2D images with foundation model features into 3D scenes to help robots recognize and understand nearby objects. F3RM can interpret open-ended language signals from humans, making the method helpful in real-world environments containing thousands of objects such as warehouses and homes.
F3RM provides robots with the ability to interpret open-ended text signals using natural language, helping the machines manipulate objects. As a result, machines can understand less-specific requests from humans and still complete the desired task. For example, if a user asks the robot to “pick up a tall mug”, the robot can detect and grab the object that best fits that description.
“Creating robots that can truly generalize to the real world is incredibly difficult,” says Ge Yang, a postdoc at the National Science Foundation AI Institute for Artificial Intelligence and Fundamental Interactions and MIT CSAIL. “We really want to figure out how to do this, so with this project, we try to push for an aggressive level of generalization, ranging from just three or four objects to any number found at MIT’s Stata Center. “To the thing. We wanted to learn how to make robots that are flexible, just like us, because we can grab and hold objects, even if we’ve never seen them before.”
Learning “by seeing what’s where”
This method can assist robots in picking items in large fulfillment centers with unavoidable chaos and unpredictability. In these warehouses, robots are often given a description of the inventory they need to identify. Despite variations in packaging, the robot must match the text assigned to an item so that customer orders are shipped correctly.
For example, the fulfillment centers of major online retailers may contain millions of items, many of which a robot may never have encountered before. To work at such a scale, robots need to understand the geometry and semantics of various objects, some of which are in tight spaces. With F3RM’s advanced spatial and semantic perception capabilities, a robot can be more effective at locating an item, placing it in a bin, and then sending it on for packaging. Ultimately, this will help factory workers ship customer orders more efficiently.
Yang says, “One thing that often surprises people with F3RM is that the same system also works at room and building scales, and can be used to build simulation environments for robot learning and large maps. can be done.” “But before we extend this work further, we want to make this system work really fast. This way, we can use this type of representation for more dynamic robotic control tasks, hopefully closer to real-world In time, so that robots can handle more dynamic tasks can use it for perception.”
The MIT team says F3RM’s ability to understand different scenes could make it useful in urban and indoor environments. For example, this approach could help personalized robots recognize and select specific objects. This system helps robots understand their surroundings – both physically and perceptually.
“Visual perception was defined by David Marr as the problem of ‘knowing what is where by looking at it,'” says senior author Philip Isola, MIT associate professor of electrical engineering and computer science and principal investigator of CSAIL. “Recent Foundation models have become really good at knowing what they’re looking at; they can recognize thousands of object categories and provide detailed textual descriptions of images. Radiance fields have become really good at representing. The combination of these two approaches can create a representation of what is where in 3D, and our work shows that this combination is particularly useful for robotic tasks, “Which requires manipulating objects in 3D.”
Creating a “Digital Twin”
F3RM begins to understand its surroundings by taking pictures on the selfie stick. The mounted camera captures 50 photographs in different poses, making it capable of producing a nerve glow zone (NERF), a deep learning method that takes 2D images to create a 3D scene. This collage of RGB photos creates a “digital twin” of its surroundings as a 360-degree representation of what’s nearby.
In addition to the highly detailed neural radiance region, F3RM also creates a feature region to enhance the geometry with semantic information. uses the system clip, a Vision Foundation model that has been trained on millions of images to efficiently learn visual concepts. By reconstructing 2D CLIP features for images captured by a selfie stick, F3RM effectively transforms 2D features into a 3D representation.
keeping things open
After gaining some exposure, the robot applies what it knows about geometry and semantics to understand objects it has never encountered before. Once a user submits a text query, the robot searches through the location of potential catches to find the one most likely to be successful in picking up the user’s requested item. Each possible option is scored based on its relevance to the signal, similarity to the repertoire the robot has been trained on, and if it causes any conflicts. The grasp with the highest score is then selected and executed.
To demonstrate the system’s ability to interpret open-ended requests from humans, the researchers prompted the robot to take on Baymax, a character from Disney’s “Big Hero 6.” While F3RM was never directly trained to pick up a toy cartoon superhero, the robot used its spatial awareness and vision-language features from the base model to decide which object to grab and how to pick it up.
F3RM also enables users to specify what objects they want the robot to handle at different levels of linguistic detail. For example, if there is a metal mug and a glass mug, the user can ask the robot for “glass mug”. If the bot sees two glass mugs and one of them is filled with coffee and the other with juice, the user can ask for “glass mug with coffee”. Foundation model features embedded within feature fields enable this level of open-ended understanding.
“If I show someone how to lift a mug from the lip, they can easily transfer that knowledge to lifting objects with similar geometry such as bowls, measuring beakers, or even rolls of tape. .For robots, it has been quite challenging to achieve this level of adaptability,” says MIT PhD student, CSAIL collaborator and co-lead author William Shen. “F3RM combines geometric understanding with semantics from foundation models trained on Internet-scale data to enable this level of aggressive generalization from just a few demonstrations.”
Shen and Yang wrote the paper under Isola’s supervision, with co-authors MIT professor and CSAIL principal investigator Leslie Pack Kelbling and graduate students Alan Yu and Johnson Wong. The team was supported in part by Amazon.com Services, the National Science Foundation, the Air Force Office of Scientific Research, the Office of Naval Research’s Multidisciplinary University Initiative, the Army Research Office, the MIT-IBM Watson Lab, and the National Institutes of Health. MIT’s Quest for Intelligence. Their work will be presented at the 2023 Conference on Robot Learning.