Sponsored by Google Cloud
Choosing Your First Generative AI Use Cases
To get started with generative AI, first focus on areas that can improve human experiences with information.
Meta’s post on new GPU clusters reveals Llama 3 training is ‘ongoing’
Meta has shared details on its AI infrastructure and unveiled new GPU clusters it is using to support next-generation model training, including Llama 3.
In a blog post, Meta provided information on two new data center-scale clusters. The new clusters are designed to support larger and more complex models than its previous hardware could.
The clusters each contain 24,576 Nvidia H100 GPUs. Previously, Meta’s original clusters contained around 16,000 Nvidia A100 GPUs. Omdia research published in 2023 placed Meta as one of Nvidia’s largest clients, snapping up thousands of its flagship H100s.
Meta will use the hardware to train current and future AI systems, with the company again referencing Llama 3, the successor to its Llama 2 model, in its blog post. The company had not published any concrete information on Llama 3 at the time of writing. However, the blog post mentions that Llama 3 training is “ongoing.”
Meta said it will also use the infrastructure for AI research and development.
Meta’s long-term goal is to build AGI, or advanced machine intelligence as its chief scientist Yann LeCun prefers to call it. Meta’s blog post states that it is scaling its clusters to power its AGI ambitions.
Meta plans to continue building out its AI infrastructure, with the company announcing that by the end of 2024, it will have 350,000 Nvidia H100 GPUs that, combined with techniques like clustering across its overall portfolio, will feature compute power equivalent to nearly 600,000 H100s.
The two new clusters were built with different network fabric solutions - one with RDMA over converged Ethernet (RoCE) based on Arista 7800 and the other with Nvidia Quantum2 InfiniBand fabric. Both offer 400 Gbps endpoints.
“With these two, we are able to assess the suitability and scalability of these different types of interconnect for large-scale training, giving us more insights that will help inform how we design and build even larger, scaled-up clusters in the future,” Meta’s blog post reads.
Both clusters are built using Grand Teton, Meta’s in-house-designed, open GPU hardware platform. Grand Teton allows Meta to build new clusters that are purpose-built for applications. The clusters also make use of Meta’s Open Rack architecture.
You May Also Like