‘Once-for-all’ approach involves a single neural network trained for deployments on thousands of platforms
Researchers at the MIT have developed an optimization strategy for deep neural networks (DNNs) that automatically prepares them for deployment on thousands of different edge devices, reducing the carbon footprint of both training and inference.
According to a paper presented at the International Conference on Learning Representations (ICLR) 2020, the problem with conventional approaches to DDNs is they require data scientists to either manually design, or use neural architecture search (NAS), to find a specialized neural network and “train it from scratch for each case” (emphasis by the authors) – i.e. each target hardware platform.
A 2019 study from the University of Massachusetts at Amherst found that a single large (213 million parameters) Transformer-based neural network built using NAS – the kind that is frequently used for machine translation – is responsible for around 626,000 pounds of carbon dioxide, as much as five cars would produce in their lifetime.
The authors of the paper say the approach dubbed ‘once-for-all,’ or OFA, reduces the number of GPU hours required to train certain types of models by “orders of magnitude” while maintaining similar, or even higher levels of accuracy; fewer GPU hours means the process consumes less electricity, and as a result, produces lower carbon emissions.
Efficiency is good
“We address the challenging problem of efficient inference across many devices and resource constraints, especially on edge devices,” states the paper from the MIT-IBM Watson AI Lab.
“For instance, one mobile application on App Store has to support a diverse range of hardware devices, from a high-end Samsung Note10 with a dedicated neural network accelerator to a five-year-old Samsung S6 with a much slower processor. With different hardware resources (e.g., on-chip memory size, #arithmetic units), the optimal neural network architecture varies significantly. Even running on the same hardware, under different battery conditions or workloads, the best model architecture also differs a lot.”
With the OTA approach, it’s possible to create a specialized sub-network for a particular device from the main network without additional training, and the authors say that models created using OTA perform better on edge devices than state-of-the-art NAS-created models, yielding 1.5x-2.6x improvements in internal tests.
As part of the architecture, the authors propose a progressive shrinking algorithm that they say can reduce model size more effectively than conventional network pruning.
In 2019, the team behind the OTA won both competitions at the 4th Low Power Computer Vision Challenge, an annual event held by the IEEE that aims to improve the energy efficiency of computer vision for running on systems with stringent resource constraints.
“These teams’ solutions outperform the best solutions in literature,” the organizers admitted at the time.