January 15, 2021
Currently more of a research project than a commercial product
Google Brain has developed an artificial intelligence language model with some 1.6 trillion parameters.
That puts it at nine times the size of the OpenAI's 175 billion parameter GPT-3, previously considered to be the world's largest language model.
While it gives some indication of the project’s scale, the models are too different in architecture for a meaningful ‘apples-to-apples’ comparison.
I like big parameters and I cannot lie
Language models predict the likelihood of a sentence existing in the real world – such as "I took my friend for dinner," would be more likely than "I took my wall for dinner."
The larger the dataset, the better the chance of the AI-generated sentence being legible, and appearing to be authored by a human.
Both the new Google Brain model and OpenAI's GPT-3 are transformer-based neural networks, which are designed to handle sequential data such as languages.
They have proved exceptionally popular in recent years as they allow for more parallelization than the previously-leading recurrent neural networks. Unlike RNNs, transformers do not need to process the beginning of a sentence before the end – massively reducing training times and cost.
But cost is still a huge issue when working at the scale of the language models proposed by Google, OpenAI, and companies like Microsoft. In its research paper Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, Google Brain focuses on trying to increase the parameter model without exponentially increasing training costs.
To do this, Google Brain turned to another rapidly developing AI field – sparse training. They used an approach called 'mixture of experts,' which is where multiple experts (essentially smaller models within the greater model) are used to divide the wider dataset into smaller regions.
This builds upon work Google revealed in 2017, when the company introduced the concept of a Sparsely-Gated Mixture-of-Experts Layer (MoE).
"The MoE consists of a number of experts, each a simple feed-forward neural network, and a trainable gating network which selects a sparse combination of the experts to process each input," the researchers said at the time. "All parts of the network are trained jointly by back-propagation."
To further improve the efficiency, the team turned to a new concept called Switch Transformer. Detailed in the new paper, it simplifies and improves standard MoE approaches.
"The guiding design principle for Switch Transformers is to maximize the parameter count of a Transformer model in a simple and computationally efficient way," the researchers explained.
Google Brain found that this approach could scale, testing out stable models from the hundreds of billions of parameters, all the way up to 1.6 trillion, without exhibiting any serious instability.
The model can run in full on supercomputers, or be distilled into small, dense versions for devices with only a few computational cores. "We reduce the model size by up to 99 percent while preserving 30 percent of the quality gains," the researchers said – noting that the model was built with Google's custom TPU chips in mind.
"Switch Transformers are scalable and effective natural language learners," the team concluded. "We find that these models excel across a diverse set of natural language tasks and in different training regimes, including pre-training, fine-tuning and multi-task training."
The 1.6 trillion parameter model appears to be less polished than GPT-3, which is open to researchers. The two platforms have not been benchmarked against each other, and Google's work has not been independently verified.
However, the company has released the code for the Switch Transformer that made its giant model possible – available on GitHub.