AI's New Frontier: Training Trillion-Parameter Models with Much Fewer GPUs

Researchers used just 8% of the world's most powerful supercomputer to train a model the size of ChatGPT

Ben Wodecki, Jr. Editor

January 8, 2024

2 Min Read
Photo of Frontier supercomputer in a story about using only 8% of GPUs to train a trillion parameters in an LLM
Credit: Oak Ridge National Laboratory

At a Glance

  • Scientists employ methods for reducing the training time of large-scale AI models on AMD GPUs.

Training a language model the size of OpenAI’s ChatGPT would normally require a sizable supercomputer. But scientists working on the world’s most powerful supercomputer discovered innovative techniques to train gigantic models using a lot less hardware.

In a new research paper, scientists from the famed Oak Ridge National Laboratory trained a one trillion parameter model using just a few thousand GPUs in their Frontier supercomputer, the most powerful non-distributed supercomputer in the world and one of only two exascale systems globally.

They used just 3,072 GPUs to train the giant large language model out of 37,888 AMD GPUs housed in Frontier. That means the researchers trained a model comparable to ChatGPT’s rumored size of a trillion parameters on just 8% of Frontier's computing power.

The Frontier team achieved this feat using distributed training strategies to essentially train the model across the unit's parallel architecture. Using techniques like shuffled data parallelism to reduce communication between layers of nodes and tensor parallelism to handle memory constraints, the team was able to distribute the training of the model more efficiently.

Other techniques the researchers employed to coordinate the model’s training include pipeline parallelism to train the model across various nodes in stages to improve speed.

Related:Supercomputer Rankings: Intel’s Aurora and Microsoft on the Rise

The results saw 100% weak scaling efficiency for models 175 billion parameter and 1 trillion parameters in size. The project also achieved strong scaling efficiencies of 89% and 87% for these two models.

A trillion parameters

Training a large language model with a trillion parameters is always a challenging undertaking. The authors said the sheer size of the model stood at a minimum 14 terabytes. For contrast, one MI250X GPU found in Frontier only has 64 Gigabytes.

Methods like the ones the researchers explored will need to be developed to overcome issues with memory.

However, one issue they faced was loss divergence due to large batch sizes. Their paper states that future research into bringing down training time for large-scale systems must see an improvement in large-batch training with smaller per-replica batch sizes.

The researchers also called for more work to be done around AMD GPUs. They wrote that most large-scale model training is done on platforms that support Nvidia solutions. While the researchers created what they called a blueprint for efficient training of LLMs on non-Nvidia platforms, they wrote: “There needs to be more work exploring efficient training performance on AMD GPUs.”

Related:TinyLlama: The Mini AI Model with a Trillion-Token Punch

Frontier held onto its crown as the most powerful supercomputer in the most recent Top500 list, pipping the Intel-powered Aurora supercomputer.

Read more about:

ChatGPT / Generative AI

About the Author(s)

Ben Wodecki

Jr. Editor

Ben Wodecki is the Jr. Editor of AI Business, covering a wide range of AI content. Ben joined the team in March 2021 as assistant editor and was promoted to Jr. Editor. He has written for The New Statesman, Intellectual Property Magazine, and The Telegraph India, among others. He holds an MSc in Digital Journalism from Middlesex University.

Keep up with the ever-evolving AI landscape
Unlock exclusive AI content by subscribing to our newsletter!!

You May Also Like