Google, MIT's SynCLR: Model Training Using Only Synthetic Data

Using AI models like Meta’s Llama 2, OpenAI’s GPT-4 and Stable Diffusion, Google and MIT scientists created a hefty dataset made up of synthetic images

Ben Wodecki, Jr. Editor

January 8, 2024

3 Min Read

The words synthetic data on a colorful background for a story on image model training using synthetic data

Getty Images / AI Business

At a Glance

SynCLR from Google and MIT introduces a novel approach to AI model training using only synthetic data.

Researchers from Google and MIT developed a new approach to training AI image models using only synthetic data to reduce laborious dataset gathering.

SynCLR trains the AI model to recognize visuals using only synthetic images and captions, according to a paper published recently.

The researchers used the seven billion parameter version of Meta’s Llama 2 to generate image captions. OpenAI’s GPT-4 was then used to create a list of suitable backgrounds for the chosen concepts in a bid to improve the plausibility of the caption scenarios.

The AI-generated captions were then compiled and used to train Stable Diffusion. The image generation model was tasked with creating images that correspond to each synthetic caption.

The resulting images and captions were compiled to create a dataset suitable for training visual representation models. The dataset, aptly named SynCaps-150M, contains 150 million systematically generated captions and images.

The SynCaps-150M dataset, however, is not available at the time of writing as the researchers are awaiting the result of an internal approval process The generated images are also not available, with the researchers posting on GitHub: “We will try to see if we can release them.”

Produce ‘infinite’ examples

Using synthetic data created by LLMs is not a new concept. OpenAI’s DALL-E 3 image generation model is largely built on synthetic data. But Google and MIT scientists wrote that the SynCLR approach of utilizing multiple systems to create varying layers of data – from initial captions to backgrounds and then the images themselves - enhances the overall quality of the synthetic dataset.

Using multiple AI systems yielded improved data results, with the authors noting that Meta’s Llama performed poorly when prompted to combine captions related to captions in and places. By bringing in GPT-4, random backgrounds from places were replaced, which improved the accuracy of the related image.

Building AI systems and gathering the necessary data to build out the underlying components of a model takes a time and a lot of money due to computational costs. Leveraging a synthetic approach, where an off-the-shelf system or low-parameter open source model generates relevant data could save developers money.

By reducing the dependency on real-world data, SynCLR could also allow developers to reduce bias prevalent in real-world image sets from creeping into their models.

“These models provide the flexibility to produce an infinite number of samples (albeit finite diversity) and control the generation process through textual input,” the paper reads. “Generative models offer a convenient and effective method for curating training data.”

SynCLR is not MIT and Google’s first foray into synthetic generations. Back in November 2023, they created StableRep, which uses AI-generated images to train AI models. The results created highly detailed images, but the overall process was slow, which in turn, could increase a user’s computation costs.

Comparable with state-of-the-art systems

In terms of performance, the researchers took the dataset and used it to power the ViT-B and ViT-L models. The resulting tests saw the models perform well when compared to other underlying visual learning systems like OpenAI’s CLIP and DINO v2. In dense prediction tasks like semantic segmentation, SynCLR even outperformed other self-supervised methods like StableRep.

To improve its performance in the future, the authors suggest adding additional datasets that bring in concepts not present on the initial list. Utilizing more advanced LLMs, like using a large parameter base model than Llama 2-7B, could be used to create an even more enhanced set of captions, the paper states.

The researchers concluded by saying their work “studies a new paradigm for visual representation learning – learning from generative models.

“Without using any real data, SynCLR learns visual representations that are comparable with those achieved by state-of-the-art general-purpose visual representation learners.”

About the Author(s)

Ben Wodecki

Jr. Editor

Ben Wodecki is the Jr. Editor of AI Business, covering a wide range of AI content. Ben joined the team in March 2021 as assistant editor and was promoted to Jr. Editor. He has written for The New Statesman, Intellectual Property Magazine, and The Telegraph India, among others. He holds an MSc in Digital Journalism from Middlesex University.

See more from Ben Wodecki

Related Topics

Recent in ML

Related Topics

Recent in NLP

Related Topics

Recent in Data

Related Topics

Recent in Automation

Related Topics

Recent in Verticals

Related Topics

Recent in Responsible AI

Related Topics

Recent in Companies

Related Topics

Google, MIT's SynCLR: Model Training Using Only Synthetic Data

At a Glance

Produce ‘infinite’ examples

Comparable with state-of-the-art systems

About the Author(s)

Latest News

Trending articles