Researchers Turn Visual Data Into Language to Help Robots Navigate

MIT and IBM's LangNav approach combines text descriptions with visual data, enhancing robots' navigation understanding and instruction following

Ben Wodecki, Jr. Editor

June 19, 2024

2 Min Read
Getty Images

Researchers have developed a novel way to help robots better navigate environments using only natural language instructions and descriptions instead of complex visual processing.

Researchers from MIT CSAIL, the MIT-IBM Watson AI Lab and Dartmouth College created the LangNav method, which converts visual information into text captions that are then used to instruct robots on how to navigate environments.

In a recently published paper, the researchers suggested that their language-based approach outperformed traditional vision-based navigation methods, enabling improved task transfer abilities.

“We show that we can learn to navigate in real-world environments by using language as a perceptual representation,” the paper reads. “Language naturally abstracts away low-level perceptual details, which we find to be beneficial for efficient data generation and sim-to-real transfer.”

Training a robot to perform a task such as picking up an object requires considerable amounts of visual information to provide them with instructions.

The researchers propose that instead of visual information, language could prove a viable alternative, generating trajectories that guide a robot to its goal.

Instead of directly using raw visual observations, the researchers converted the visual inputs into text descriptions using off-the-shelf computer vision models for image captioning (BLIP) and object detection (Deformable DETR). 

Related:Reinforcement Learning Enhances Urban Maneuverability for Wheeled Robots

The text descriptions of the visual scenes were converted into natural language and input into a large pre-trained language model, fine-tuned for navigation tasks. 

The resulting method generated text-based instructions for a robot, offering detailed guidance on how to navigate a specific path. For example: “Go down the stairs and straight into the living room. In the living room walk out onto the patio. On the patio stop outside the doorway.”

By representing the visual scene through language, the method enables a robot to better understand the path it’s required to take, requiring its hardware to process less information.

The paper suggests the LangNav approach outperformed traditional robotic navigation methods that rely on solely using visual information.

The language-based approach even worked well in low-data settings, where only a few expert navigation examples were available for training.

“Our approach is found to improve upon baselines that rely on visual features in settings where only a few expert trajectories are available, demonstrating the potential of language as a perceptual representation for navigation,” the paper reads.

Related:Security Robot Maker Acquisition Expands Company Focus to AI

While the researchers described the approach as promising, they noted that LangNav is somewhat limited. Some of the visual information can be lost when transferred into the language model, harming its ability to understand entire scenes.

Read more about:

ChatGPT / Generative AI

About the Author

Ben Wodecki

Jr. Editor

Ben Wodecki is the Jr. Editor of AI Business, covering a wide range of AI content. Ben joined the team in March 2021 as assistant editor and was promoted to Jr. Editor. He has written for The New Statesman, Intellectual Property Magazine, and The Telegraph India, among others. He holds an MSc in Digital Journalism from Middlesex University.

Keep up with the ever-evolving AI landscape
Unlock exclusive AI content by subscribing to our newsletter!!

You May Also Like