7 language models you need to know

Parameter size is not always an indication of quality. (Second of a 2-part series.)

Ben Wodecki

July 27, 2022

10 Min Read

Parameter size is not always an indication of quality. (Second of a 2-part series.)

There are numerous language models out there performing tasks ranging from the incredibly simple to the extremely complex.

While the model landscape can be daunting at times, AI Business is here to help by compiling a list of arguably the seven most important models with the biggest impact on the AI landscape.

1. GPT-3

Developers: OpenAI

Parameters: 175 billion

The darling of the language model world, GPT-3 uses deep learning to produce human-like text.

Initially released in 2020, this model was trained using a method called generative pretraining, essentially meaning GPT-3 was taught to predict what the next input would be.

Microsoft invested $1 billion into GPT-3’s developer, OpenAI, in 2019 and holds an exclusive license to use the model. Developers can still use the public API, but only Microsoft has access to GPT-3's underlying model.

Related story: A (relatively) simple guide to language models

Unlike some of the newer models on this list, GPT-3 has been applied in a plethora of cases. Here are some examples:

Playwright: GPT-3 was used by a theatre group in the U.K. to write a play. Summer 2021 saw a production at the Young Vic theatre in London that was 'written' by the model. Throughout a three-day performance, writers put prompts into the system, which then generated a story. The actors would then adapt lines to improve the narrative and feed further prompts to guide the story’s progression.

Phishing: Researchers from Singapore's Government Technology Agency used GPT-3 to generate phishing emails in a bid to attract unwitting users. The AI-generated emails included highly specific details, such as references to Singapore law, when it was prompted to develop content for residents. The study was conducted to showcase how language models could be used for nefarious purposes.

Dungeon master: GPT-3 was used in AI Dungeon, a text-based adventure game similar to Zork. The AI model generates content, allowing players to create their own custom adventures. At times, the GPT-3 version of the game would develop inappropriate graphic and sexual content despite not being prompted by players. The model was replaced in late 2021 after OpenAI changed its policy regarding generated content.

Copywriter: The Guardian used the GPT-3 model to write an article. The model was fed ideas and produced eight different essays, which editors then merged into one.

Other applications: Fable Studios uses GPT-3 to create characters for VR experiences. Web search startup Algolia taps it to improve its products. And Create Labs is making use of GPT-3 to enhance its social venture projects.

2. Bloom

Developers: Hugging Face, BigScience

Parameters: 176 billion

One of the newest models on this list, Bloom is an open source model developed by a consortium of more than 1,000 AI researchers who sought to create a multilingual language model.

BLOOM, or BigScience Large Open-science Open-access Multilingual Language Model, can generate text in 46 natural languages and 13 programming languages. It is the first time languages such as French and Arabic are represented in a language model with more than 100 billion parameters.

The model can be accessed and used on a local machine or in the cloud. And if researchers do not have access to large servers to train their models, an inference API for large-scale use without dedicated hardware or engineering is set to be released shortly.

To access BLOOM, users must agree to a license banning its application in several restricted cases, including generating false information to harm others and automating decision-making that harms someone’s legal rights.

Given BLOOM is open source, developers can now access and use a sizable language model previously reserved for private tech companies with deep pockets. According to CambrianAI analyst Alberto Romero, BLOOM will “break the stranglehold big tech has on the research and development of large language models.”

3. ESMFold

Developers: Meta AI

Parameters: 15 billion

The most recent model to be released on this list, ESMFold, or Evolutionary Scale Modeling, can accurately predict full atomic protein structures from a single sequence of a protein.

Predicting a protein’s 3D structure has the potential to speed up drug discoveries. Meta’s AI model aims to do this, with ESMFold boasting “high accuracy end-to-end atomic level structure prediction directly from the individual sequence of a protein.”

Meta’s AI researchers fed protein data into their language model to see if it would predict protein structures.

They found their model achieved similar accuracy to AlphaFold2 or sequences with low perplexity, according to a research paper covering the new model. “ESMFold inference is an order of magnitude faster than AlphaFold2, enabling exploration of the structural space of metagenomic proteins in practical timescales.”

And to challenge rival model AlphaFold2, Meta research engineer Zeming Lin said his team plans to open source ESMFold in the future.

A note about DeepMind’s AlphaFold:

Developed by the Google-owned DeepMind, AlphaFold has 21 million parameters and can predict protein structures. The deep learning system was trained on more than 170,000 proteins from a public repository of protein sequences and structures. AlphaFold uses an attention network - a deep learning technique where an algorithm recognizes parts of a larger problem — then pieces them together to obtain the overall solution.

The initial version of the model was released in late 2018, with a second version, AlphaFold 2, publishing in 2020. DeepMind went on to release the source code a year later.

4. Gato

Developers: DeepMind

Parameters: 79 million, 364 million and 1.18 billion

Arguably one of the most important models on this list is Gato is a “general purpose” system designed to take on several different tasks. Developed across a handful of different sizes, this model differs from others on this list as it can undertake a handful of different tasks. Traditionally, most AI systems are taught one or two responsibilities.

Short for ‘a General Agent,’ Gato can play Atari, caption images, chat, stack blocks with a real robot arm and more. The system can decide whether to output text, joint torques, button presses, or other tokens based on context.

How Gato works: Technobabble incoming

The model was trained on data covering different tasks and modalities. This data was serialized into a flat sequence of tokens which was then batched and processed by a transformer neural network similar to a large language model. Upon deployment, a prompt is tokenized, which forms an initial sequence. The environment yields the first observation – which again, is tokenized and appended to the sequence. Gato then samples the action vector autoregressively, one token at a time.

Once all tokens comprising the action vector have been sampled, the action is decoded and sent to the environment, which steps and yields a new observation. Then the procedure repeats. DeepMind’s researchers suggest the model “always sees all previous observations and actions within its context window of 1024 tokens.”

For a full explanation, check out DeepMind’s paper.

A step towards artificial general intelligence?

Gato’s impact on the AI world could be profound. At least, that is what DeepMind likely is hoping for.

The model was built on a sizable dataset comprising data from both simulated and real-world environments. It was also built using several natural language and image datasets.

But while it can perform a lot of tasks, it does not necessarily do them all well. For example, when generating dialogue, the model tends to generate "often superficial or factually incorrect” responses, according to the paper.

The model also struggles with memory constraints, much to the detriment of learning to adapt to a new task via conditioning on a prompt, like demonstrations of desired behavior. But while there are kinks to iron out, Gato is certainly a step closer to the far-flung concept of general intelligence compared to other models on this list.

5. WuDao 2.0

Developers: Beijing Academy of Artificial Intelligence

Parameters: 1.75 trillion parameters

The biggest model on this list and in the world, WuDao can simulate conversational speech, write poems and understand images.

The first iteration of Wu Dao was showcased in January 2021, with WuDao 2.0 unveiled just a few months later in May.

The model is comparable with GPT-3 in terms of having similar architectures. But WuDao blows GPT-3 out of the water with its size ­— a whopping 1.75 trillion parameters, making it the world’s largest language model. Comparatively, Google's Switch Transformer, announced last January, featured 1.6 trillion parameters.

WuDao was trained on 4.9 terabytes of images and texts – including 1.2 terabytes of Chinese text and 1.2 terabytes of English text. It is important to note that the size of a language model often does not correlate to quality – and because WuDao is not a monolithic transformer model, it prevents a meaningful ‘apples-to-apples’ comparison.

Little is known though about exactly what made up the datasets used to train the latest version – nor what applications the Beijing AI Academy intends to use the model for. One task WuDao can reportedly conduct however is predicting the 3D structures of proteins – similar to ESMFold and AlphaFold – without being trained to solely conduct such tasks.

6. MT-NLG

Developer: Nvidia, Microsoft

Parameters: 530 billion

Megatron-Turing Natural Language Generation, or MT-NLG, is the largest monolithic transformer-based language model. It can perform several natural language tasks, including natural language inferences and reading comprehension.

The successor to Microsoft’s Turing NLG 17B and Nvidia’s Megatron-LM language models, MT-NLG can auto-complete sentences, and read and deduct commonsense reasoning.

The model was trained on 15 datasets consisting of a total of 339 billion tokens from English-language websites. This was later whittled down to 270 billion tokens. Nvidia’s Selene ML supercomputer was used to train the model, which is comprised of 560 DGX A100 servers, each containing eight A100 80GB GPUs.

The model can perform a broad set of language tasks of high quality, with the partner companies suggesting MT-NLG has the potential to “shape tomorrow’s products and motivate the community to push the boundaries of natural language processing even further.”

7. LaMDA

Developer: Google

Parameters: 137 billion

Google’s LaMDA (Language Model for Dialogue Applications) model is so accurate that it purportedly convinced an AI engineer it was sentient.

When it is not scaring engineers, the model can generate conversational dialogue in a free-form way – compared to task-based responses traditional models often come up with.

This is because LaMDA was trained on dialogue. According to Google, its dialogue-based approach allowed the model to pick up on the nuances that distinguished open-ended conversation from other forms of language.

First showcased at the company’s I/O event in May 2021, Google plans on using the model across its products – including its search engine, Google Assistant and Workspace platform.

And at its 2022 I/O event, the company announced expansions to the model’s capabilities via LaMDA 2. The latest version is reportedly more finely tuned than the original − and can now provide recommendations based on user queries. LaMDA2 was trained on Google’s Pathways Language Model (PaLM), which has 540 billion parameters.

About the Authors

Ben Wodecki

Assistant Editor

Get the newsletter
From automation advancements to policy announcements, stay ahead of the curve with the bi-weekly AI Business newsletter.