October 27, 2023
At a Glance
- ConvNets can match vision transformers for computer vision with enough data and compute.
- Meta Chief AI Scientist Yann LeCun says 'Compute is all you need.'
Convolutional neural networks (CNNs) have been among the dominant model architectures for computer vision tasks like image classification. But there has been rising interest in using transformer models called Vision Transformers (ViTs) because they perform better at scale on computational accuracy and efficiency.
Now, researchers from Google DeepMind argue that both systems achieve the same results – with only the amount of computing used to train the system being what matters.
That means companies with computer vision needs do not need to switch to the ViT architecture to get state-of-the-art accuracy. Instead, with enough data and computing, CNN performance improves predictably and investing in larger models and training infrastructure will pay off.
In a paper titled ‘ConvNets Match Vision Transformers at Scale,' Google researchers found that ViTs can achieve the same results as CNNs simply by using more computing. They took a CNN architecture called NFNet and trained it on a massive dataset of four billion images – using up to 110,000 hours on Google’s TPU chips.
The resulting model matched the accuracy of existing ViT systems that used similar training compute.
Yann LeCun, Meta’s chief AI scientist and Turing award winner, said in a post on X (Twitter) that the findings show “compute is all you need,” and that both CNNs and ViTs “have a role to play.”
What does this mean?
The researchers propose that the choice of system architecture for computer vision use cases is not black or white – and that CNNs are still a strong option thanks to their ability to match ViTs given sufficient resources.
The researchers found that as they increased the compute budget for pre-training the NFNet models, the performance on the validation set improved following what’s known as a log-log scaling law – this shows model builders that exponential gains in one variable translate to proportional linear gains in the other variable to help make sense of efficient and predictable scaling.
The researchers found that as they increased the compute budget exponentially (horizontal axis), the validation loss decreased linearly (vertical axis).
In simpler terms – the more compute budget CNN developers use, the more it leads to predictable gains in model accuracy, without diminishing returns.
"Although the success of ViTs in computer vision is extremely impressive, in our view there is no strong evidence to suggest that pre-trained ViTs outperform pre-trained ConvNets when evaluated fairly,” according to the paper.
“The most important factors determining the performance of a sensibly designed model are the compute and data available for training. Although the success of ViTs in computer vision is extremely impressive, in our view there is no strong evidence to suggest that pre-trained ViTs outperform pre-trained ConvNets when evaluated fairly.”
About the Author(s)
You May Also Like