The machine will be capable of training AI models on real-world data sourced from the company’s platforms
Meta (previously known as Facebook) is attempting to build the world’s fastest AI supercomputer.
The AI Research SuperCluster (RSC) will feature 16,000 Nvidia A100 GPUs and is set for completion in the middle of 2022.
The company has already started using the hardware to train large computer vision and natural language processing (NLP) models.
“The experiences we're building for the metaverse require enormous compute power (quintillions of operations/second) and RSC will enable new AI models that can learn from trillions of examples, understand hundreds of languages, and more," said Mark Zuckerberg, chairman, CEO and controlling shareholder of the company.
Unlike Meta’s previous AI supercomputer launched in 2017, the RSC will be capable of training machine learning models on real-world data sourced from the company’s social media platforms.
Meta has been involved in AI research for more than a decade. It established the Facebook AI Research lab (FAIR) in 2013, which went on to develop tools for chatbot design, methods for making AI systems forget unnecessary information, and ‘synthetic skin’ that gives robots the sense of touch.
The lab’s most important contribution to the field is undoubtedly PyTorch, an open source deep learning framework that emerged as something of a standard and is now widely used by developers and data scientists across a variety of platforms.
Meta launched its first dedicated AI supercomputer in 2017, built with 22,000 Nvidia V100 GPUs.
The machine is being considerably outclassed by its successor, with Meta claiming RSC already delivers three times more performance in large scale NLP workflows, using less than half of its final hardware footprint.
The first phase of the project consists of 760 Nvidia DGX A100 server systems with a total of 6,080 GPUs, connected using Nvidia’s Quantum 200 Gb/s InfiniBand fabric.
The storage tier is equipped with 185PB of all-flash memory from Pure Storage, and 46PB of cache storage spread across Penguin Computing Altus servers. Training data is delivered through FAIR’s own purpose-built storage service called the AI Research Store (AIRStore).
Once the RSC is complete, the same InfiniBand fabric will connect 16,000 GPUs, making this the largest DGX A100 deployment to date. It will be served by a caching and storage system with 16 TB/s of bandwidth and is expected to deliver nearly 5 exaflops of mixed precision compute.
“We wanted this infrastructure to be able to train models with more than a trillion parameters on data sets as large as an exabyte — which, to provide a sense of scale, is the equivalent of 36,000 years of high-quality video,” Facebook’s technical program manager Kevin Lee and software engineer Shubho Sengupta said in a post on the company’s blog.
Unlike its previous supercomputer, which leveraged only open source and publicly available data sets, Meta’s new machine will be using real-world training data obtained directly from the users of the company’s platforms.
For this reason, Meta says the RSC has been designed from the ground up with privacy and security in mind: the supercomputer is isolated from the Internet, with no direct inbound or outbound connections, and traffic can flow only from Meta’s production data centers. User data is anonymized, and the entire data path from the storage systems to the GPUs is encrypted.
“We hope RSC will help us build entirely new AI systems that can, for example, power real-time voice translations to large groups of people, each speaking a different language, so they can seamlessly collaborate on a research project or play an AR game together,” Lee and Sengupta said.
“Ultimately, the work done with RSC will pave the way toward building technologies for the next major computing platform — the metaverse, where AI-driven applications and products will play an important role.”
The authors said that the RSC will also be used to help better identify “harmful content” – Meta’s recent advances in this area include the introduction of few-shot learning (FSL) to more easily detect posts that attempt to breach its policy in new and unexpected ways.
Supply chain blues
The ongoing chip supply shortage has affected countless infrastructure projects, and the RSC was no exception.
“RSC began as a completely remote project that the team took from a simple shared document to a functioning cluster in about a year and a half,” Lee and Sengupta said.
“COVID-19 and industry-wide wafer supply constraints also brought supply chain issues that made it difficult to get everything from chips to components like optics and GPUs, and even construction materials — all of which had to be transported in accordance with new safety protocols.
“To build this cluster efficiently, we had to design it from scratch, creating many entirely new Meta-specific conventions and rethinking previous ones along the way. We had to write new rules around our data center designs — including their cooling, power, rack layout, cabling, and networking (including a completely new control plane), among other important considerations.”
This article was first published on Data Center Knowledge - Subscribe to the DCK newsletter to get all things data center straight to your inbox!