Researchers Turn Visual Data Into Language to Help Robots Navigate
MIT and IBM's LangNav approach combines text descriptions with visual data, enhancing robots' navigation understanding and instruction following
Researchers have developed a novel way to help robots better navigate environments using only natural language instructions and descriptions instead of complex visual processing.
Researchers from MIT CSAIL, the MIT-IBM Watson AI Lab and Dartmouth College created the LangNav method, which converts visual information into text captions that are then used to instruct robots on how to navigate environments.
In a recently published paper, the researchers suggested that their language-based approach outperformed traditional vision-based navigation methods, enabling improved task transfer abilities.
“We show that we can learn to navigate in real-world environments by using language as a perceptual representation,” the paper reads. “Language naturally abstracts away low-level perceptual details, which we find to be beneficial for efficient data generation and sim-to-real transfer.”
Training a robot to perform a task such as picking up an object requires considerable amounts of visual information to provide them with instructions.
The researchers propose that instead of visual information, language could prove a viable alternative, generating trajectories that guide a robot to its goal.
Instead of directly using raw visual observations, the researchers converted the visual inputs into text descriptions using off-the-shelf computer vision models for image captioning (BLIP) and object detection (Deformable DETR).
The text descriptions of the visual scenes were converted into natural language and input into a large pre-trained language model, fine-tuned for navigation tasks.
The resulting method generated text-based instructions for a robot, offering detailed guidance on how to navigate a specific path. For example: “Go down the stairs and straight into the living room. In the living room walk out onto the patio. On the patio stop outside the doorway.”
By representing the visual scene through language, the method enables a robot to better understand the path it’s required to take, requiring its hardware to process less information.
The paper suggests the LangNav approach outperformed traditional robotic navigation methods that rely on solely using visual information.
The language-based approach even worked well in low-data settings, where only a few expert navigation examples were available for training.
“Our approach is found to improve upon baselines that rely on visual features in settings where only a few expert trajectories are available, demonstrating the potential of language as a perceptual representation for navigation,” the paper reads.
While the researchers described the approach as promising, they noted that LangNav is somewhat limited. Some of the visual information can be lost when transferred into the language model, harming its ability to understand entire scenes.
Read more about:
ChatGPT / Generative AIAbout the Author
You May Also Like