IT & Data Center

The 5 petaflops Nvidia DGX A100 hopes to run your AI workloads

by Sebastian Moss
Article Image

While HGX will do it for the cloud

What do you get if you take eight of Nvidia’s new A100 GPUs, a dual 64-core AMD Rome CPU, six NVSwitches, 15TB of Gen 4 NVME SSD, nine Mellanox 200Gbps Network interfaces, and package them all together?

Well, a bill for $199,000 - but also a lot of AI performance. Nvidia’s latest DGX reference architecture is the company’s preferred approach to shipping its highest performance chips.

The DGX A100, as the most recent iteration is named, is capable of five petaflops of FP16 performance, or 2.5 petaflops TF32, and 156 teraflops FP64. It also runs at 10 petaops (not flops) with INT8.

AI ready

“Nvidia DGX A100 is the ultimate instrument for advancing AI,” Jensen Huang, the ebullient company CEO, said as he unveiled the product during Nvidia’s now-virtual GTC.

“Nvidia DGX is the first AI system built for the end-to-end machine learning workflow - from data analytics to training to inference. And with the giant performance leap of the new DGX, machine learning engineers can stay ahead of the exponentially growing size of AI models and data.”

Among the first customers of the DGX, which has 320GB of memory for training large AI datasets, is the Argonne National Laboratory. Rick Stevens, associate laboratory director at the Department of Energy facility, said that the system would be used “in the fight against COVID-19.”

He added: “The compute power of the new DGX A100 systems coming to Argonne will help researchers explore treatments and vaccines and study the spread of the virus, enabling scientists to do years’ worth of AI-accelerated work in months or days.”

Nvidia has also released a version of the DGX on steroids: the DGX SuperPOD reference architecture. It's 140 DGX A100 systems all clustered together, capable of 700 petaflops of 'AI computing power.'

So far, the SuperPOD has just one customer: Nvidia. The company plans to install four of the pods as part of its internal Saturn V supercomputer, adding 2.8 exaflops of AI computing power, for a total of 4.6 exaflops. 

For cloud computing companies like Amazon Web Services, Google, and Microsoft Azure, there’s a slightly smaller option: The HGX A100.

It will feature four A100s, instead of the DGX’s eight.

Moving further down the power scale is the EGX A100, with just one GPU and a Mellanox ConnectX-6 SmartNIC, targeting the edge market.

Practitioner Portal - for AI practitioners

Story

Hesai and Scale AI open-source LiDAR data set for autonomous car training

6/2/2020

Scale claims this is the first time such data has been released with zero restrictions

Story

IBM adds free AI training data sets to Data Asset eXchange

5/28/2020

Big Blue has something for you

Practitioner Portal

EBooks

More EBooks

Upcoming Webinars

More Webinars

Experts in AI

Partner Perspectives

content from our sponsors

Research Reports

9/30/2019
More Research Reports

Infographics

Understanding the advantages of AI chatbots over rule-based chatbots

Infographics archive

Newsletter Sign Up


Sign Up