Meta’s Chameleon AI Model Seamlessly Handles Text and Images

Meta’s new model handles text and images equally, enabling it to generate high-quality visuals and text with ease

Ben Wodecki, Jr. Editor

May 24, 2024

3 Min Read
A chameleon in the colors of Meta Platforms clinging to a tree branch
AI Business via DALL·E 3

Meta has unveiled a family of multimodal AI models that seamlessly integrate visual and textual information.

Developed by Meta’s Fundamental AI Research (FAIR) team, Chameleon is designed to perform a range of tasks, including answering questions about visuals and generating image captions.

The models can perform a broad range of multimodal tasks, achieving state-of-the-art performance across image captioning tasks while equally handling text and visual data.

Chameleon can generate text-based responses and images using a single model. Comparatively, other AI systems tap other AI models for help with other tasks like ChatGPT uses DALL-E 3 to generate its images.

For example, the Chameleon models can create an image of an animal, like a bird, and answer user questions about a particular species.

The Chameleon models outperform Llama 2. It’s competitive when compared to models like Mistral’s Mixtral 8x7B and Google’s Gemini Pro.

Chameleon even keeps pace with larger-scale systems like OpenAI’s GPT-4V.

Its capabilities could power multimodal features in Meta AI, the recently released chatbot across Meta’s social media apps, including Facebook, Instagram and WhatsApp.

Meta currently uses Llama 3 to power Meta AI but could follow ChatGPT’s lead and use multiple underlying systems to perform different tasks, like to better answer user queries about photos on Instagram.

Related:OpenAI Unveils New Model, Widens Access to ChatGPT Tools

“Chameleon unlocks entirely new possibilities for multimodal interaction[s],” the researchers wrote.

Meta’s Chameleon follows the unveiling of another multimodal AI model, OpenAI’s GPT-4o, which is being used to power ChatGPT’s new visual capabilities.

Architecture Tweaks Improve Mixed Modality Handling

The new Chameleon model uses a combination of architectural innovations and innovative training techniques.

Under the hood, the Chameleon models use an architecture that largely follows Llama 2. However, Meta’s researchers tweaked the underlying transformer architecture to ensure the model performed when handling mixed modalities.

Those changes include introducing techniques including query-key normalization and revised placement of layer norms.

They also utilized two tokenizers, which process input data, using one for text and one for visuals. The data is then used to form the entire input. The same process occurs in Chameleon’s outputs, enabling the model to focus on the data coming in and out.

Through their changes, the researchers were able to train the model on five times the tokens used to train Llama 2 despite Chameleon being under half the size, standing at 34 billion parameter model.

Related:Meta Unveils Llama 3, the Most Powerful Open Source Model Yet

The researchers said the techniques used to develop Chameleon could enable scalable training of token-based AI models.

“Chameleon represents a significant step towards realizing the vision of unified foundation models capable of flexibility reasoning over and generating multimodal content,” Meta’s researchers wrote.

Read more about:

ChatGPT / Generative AI

About the Author(s)

Ben Wodecki

Jr. Editor

Ben Wodecki is the Jr. Editor of AI Business, covering a wide range of AI content. Ben joined the team in March 2021 as assistant editor and was promoted to Jr. Editor. He has written for The New Statesman, Intellectual Property Magazine, and The Telegraph India, among others. He holds an MSc in Digital Journalism from Middlesex University.

Keep up with the ever-evolving AI landscape
Unlock exclusive AI content by subscribing to our newsletter!!

You May Also Like