Want Better Image Models? More Compute is All You Need

New Google DeepMind research shows that computing is the key to scaling computer vision models, even when using CNNs

Ben Wodecki, Jr. Editor

October 27, 2023

2 Min Read
Illustration of a magnifying glass on a digital terrain
AI Business via DALL-E 3

At a Glance

  • ConvNets can match vision transformers for computer vision with enough data and compute.
  • Meta Chief AI Scientist Yann LeCun says 'Compute is all you need.'

Convolutional neural networks (CNNs) have been among the dominant model architectures for computer vision tasks like image classification. But there has been rising interest in using transformer models called Vision Transformers (ViTs) because they perform better at scale on computational accuracy and efficiency.

Now, researchers from Google DeepMind argue that both systems achieve the same results – with only the amount of computing used to train the system being what matters.

That means companies with computer vision needs do not need to switch to the ViT architecture to get state-of-the-art accuracy. Instead, with enough data and computing, CNN performance improves predictably and investing in larger models and training infrastructure will pay off.

In a paper titled ‘ConvNets Match Vision Transformers at Scale,' Google researchers found that ViTs can achieve the same results as CNNs simply by using more computing. They took a CNN architecture called NFNet and trained it on a massive dataset of four billion images – using up to 110,000 hours on Google’s TPU chips.

The resulting model matched the accuracy of existing ViT systems that used similar training compute.

Yann LeCun, Meta’s chief AI scientist and Turing award winner, said in a post on X (Twitter) that the findings show “compute is all you need,” and that both CNNs and ViTs “have a role to play.”

Related:Why Language Models Fail: Ways to Enhance AI for Effective Deployments

What does this mean?

The researchers propose that the choice of system architecture for computer vision use cases is not black or white – and that CNNs are still a strong option thanks to their ability to match ViTs given sufficient resources.

The researchers found that as they increased the compute budget for pre-training the NFNet models, the performance on the validation set improved following what’s known as a log-log scaling law – this shows model builders that exponential gains in one variable translate to proportional linear gains in the other variable to help make sense of efficient and predictable scaling.

The researchers found that as they increased the compute budget exponentially (horizontal axis), the validation loss decreased linearly (vertical axis).

In simpler terms – the more compute budget CNN developers use, the more it leads to predictable gains in model accuracy, without diminishing returns.

"Although the success of ViTs in computer vision is extremely impressive, in our view there is no strong evidence to suggest that pre-trained ViTs outperform pre-trained ConvNets when evaluated fairly,” according to the paper.

“The most important factors determining the performance of a sensibly designed model are the compute and data available for training. Although the success of ViTs in computer vision is extremely impressive, in our view there is no strong evidence to suggest that pre-trained ViTs outperform pre-trained ConvNets when evaluated fairly.”

About the Author(s)

Ben Wodecki

Jr. Editor

Ben Wodecki is the Jr. Editor of AI Business, covering a wide range of AI content. Ben joined the team in March 2021 as assistant editor and was promoted to Jr. Editor. He has written for The New Statesman, Intellectual Property Magazine, and The Telegraph India, among others. He holds an MSc in Digital Journalism from Middlesex University.

Keep up with the ever-evolving AI landscape
Unlock exclusive AI content by subscribing to our newsletter!!

You May Also Like