Meta Reveals GPU Clusters Used to Train Llama 3

Meta’s post on new GPU clusters reveals Llama 3 training is ‘ongoing’

March 12, 2024

2 Min Read

Three llamas in a field of flowers under a blue sky

Meta wants 350,000 Nvidia H100 GPUs by the end of 2024AI Business via ChatGPT

At a Glance

Meta unveils two new data center-scale GPU clusters that are powering its AI model training.
The new clusters are training, among other projects, Llama 3, the next iteration of its popular line of open-source models.

Meta has shared details on its AI infrastructure and unveiled new GPU clusters it is using to support next-generation model training, including Llama 3.

In a blog post, Meta provided information on two new data center-scale clusters. The new clusters are designed to support larger and more complex models than its previous hardware could.

The clusters each contain 24,576 Nvidia H100 GPUs. Previously, Meta’s original clusters contained around 16,000 Nvidia A100 GPUs. Omdia research published in 2023 placed Meta as one of Nvidia’s largest clients, snapping up thousands of its flagship H100s.

Meta will use the hardware to train current and future AI systems, with the company again referencing Llama 3, the successor to its Llama 2 model, in its blog post. The company had not published any concrete information on Llama 3 at the time of writing. However, the blog post mentions that Llama 3 training is “ongoing.”

Meta said it will also use the infrastructure for AI research and development.

Meta’s long-term goal is to build AGI, or advanced machine intelligence as its chief scientist Yann LeCun prefers to call it. Meta’s blog post states that it is scaling its clusters to power its AGI ambitions.

Meta plans to continue building out its AI infrastructure, with the company announcing that by the end of 2024, it will have 350,000 Nvidia H100 GPUs that, combined with techniques like clustering across its overall portfolio, will feature compute power equivalent to nearly 600,000 H100s.

Technical talk

The two new clusters were built with different network fabric solutions - one with RDMA over converged Ethernet (RoCE) based on Arista 7800 and the other with Nvidia Quantum2 InfiniBand fabric. Both offer 400 Gbps endpoints.

“With these two, we are able to assess the suitability and scalability of these different types of interconnect for large-scale training, giving us more insights that will help inform how we design and build even larger, scaled-up clusters in the future,” Meta’s blog post reads.

Both clusters are built using Grand Teton, Meta’s in-house-designed, open GPU hardware platform. Grand Teton allows Meta to build new clusters that are purpose-built for applications. The clusters also make use of Meta’s Open Rack architecture.

About the Author(s)

Ben Wodecki

Jr. Editor

Ben Wodecki is the Jr. Editor of AI Business, covering a wide range of AI content. Ben joined the team in March 2021 as assistant editor and was promoted to Jr. Editor. He has written for The New Statesman, Intellectual Property Magazine, and The Telegraph India, among others. He holds an MSc in Digital Journalism from Middlesex University.

See more from Ben Wodecki

Related Topics

Recent in ML

Related Topics

Recent in NLP

Related Topics

Recent in Data

Related Topics

Recent in Automation

Related Topics

Recent in Verticals

Related Topics

Recent in Responsible AI

Related Topics

Recent in Companies

Related Topics

Meta Reveals GPU Clusters Used to Train Llama 3

At a Glance

Technical talk

About the Author(s)

Latest News

Trending articles