Nvidia said today that its Blackwell chips are setting the standard for AI and is distributing its AI chips to data centers and what it refers to as AI factories across the globe.
Next-generation AI applications that leverage the most recent developments in training and inference are being trained and deployed more quickly thanks to Nvidia and its partners.
These new applications have higher performance needs, which the Nvida Blackwell architecture is designed to handle. The Nvidia AI platform powered every result submitted on the benchmark’s most difficult large language model (LLM)-focused test, Llama 3.1 405B pretraining, and delivered the highest performance at scale on every benchmark in the most recent round of MLPerf Training, the 12th since the benchmark’s inception in 2018.
The Nvidia platform demonstrated its remarkable performance and adaptability across a broad range of AI workloads, including LLMs, recommendation systems, multimodal LLMs, object identification, and graph neural networks, as it was the only one to submit results on all MLPerf Training v5.0 benchmarks.
Two AI supercomputers driven by the Nvidia Blackwell platform were utilized in the at-scale submissions: Nyx, which was based on Nvidia DGX B200 systems, and Tyche, which was constructed using Nvidia GB200 NVL72 rack-scale systems. Additionally, Nvidia used 1,248 Nvidia Grace CPUs and 2,496 Blackwell GPUs in partnership with CoreWeave and IBM to submit GB200 NVL72 findings.
Blackwell outperformed previous-generation architectures at the same scale by 2.2 times on the new Llama 3.1 405B pretraining test.
Nvidia DGX B200 workstations with eight Blackwell GPUs produced 2.5 times the performance on the Llama 2 70B LoRA fine-tuning benchmark when compared to a submission with the same number of GPUs in the previous round.
High-density liquid-cooled racks, 13.4TB of coherent memory per rack, fifth-generation Nvidia NVLink and Nvidia NVLink Switch interconnect technologies for scale-up, and Nvidia Quantum-2 InfiniBand networking for scale-out are just a few of the innovations in the Blackwell architecture that are highlighted by these performance increases. Additionally, advancements in the Nvidia NeMo Framework software stack set the standard for multimodal LLM training in the future, which is essential for commercializing agentic AI systems.
One day, these AI-powered apps will operate in AI factories, which are the backbone of the agentic AI economy. Almost every sector and academic field will be able to use the tokens and useful intelligence generated by these new apps.
GPUs, CPUs, networking, high-speed fabrics, and a wide range of applications, such as Nvidia CUDA-X libraries, the NeMo Framework, Nvidia TensorRT-LLM, and Nvidia Dynamo, are all part of the Nvidia data center platform. By enabling enterprises to train and deploy models more rapidly, this finely tailored combination of hardware and software technologies significantly accelerates time to value.
In this MLPerf round, the Nvidia partner ecosystem was heavily involved. In addition to CoreWeave and IBM, ASUS, Cisco, Giga Computing, Lambda, Lenovo Quanta Cloud Technology, and Supermicro all made strong proposals.
With over 125 members and affiliates, the MLCommons Association created the first MLPerf Training submissions utilizing GB200. Its time-to-train metric guarantees that a model with the necessary accuracy is produced during the training process. Additionally, its established benchmark run conditions guarantee performance comparisons that are comparable. Prior to publishing, the results undergo peer assessment.
The Fundamentals Of Training Benchmarks
I knew Dave Salvator when he worked for the tech press. He currently serves as the director of accelerated computing products for Nvidia’s Accelerated Computing Group. Salvator pointed out that Jensen Huang, the CEO of Nvidia, discusses the idea of the many scaling rules for AI during a press briefing. Among these is pre-training, which essentially teaches the AI model. That is beginning at the beginning. According to Salvator, the foundation of AI is a significant computing effort.
Nvidia then transitions to post-training scaling. Models sort of go to school here, and you can do things like fine-tuning, which involves teaching a pre-trained model that has been trained up to a certain point more domain knowledge about your specific data set by bringing in a different data set.
Finally, there is time-test reasoning, often known as extended thinking or time-test scaling. Agentic AI is another name for this. It’s artificial intelligence (AI) that can think, reason, and solve problems; essentially, you ask it a question and receive a fairly straightforward response. In fact, test time reasoning and scaling can produce rich analysis and work on far more complex issues.
Additionally, there is generative AI, which may produce information as needed, including visual, audio, and texts that have been summarized and translated. The field of artificial intelligence experiences several different kinds of scaling. Nvidia concentrated on pre-training and post-training outcomes for the benchmarks.
This marks the start of what we refer to as the investment phase of AI. “You start to see your return on your investment in AI when you start inferencing, deploying those models, and then generating those tokens,” he said.
The MLPerf benchmark was created in 2018 and is now in its 12th iteration. It has been used for both inference and training tests, and the consortium supporting it has more than 125 members. The benchmarks are regarded as strong by the industry.
“As many of you are probably aware, performance claims in the AI space may occasionally be a little Wild West. Salvator stated, “MLPerf aims to bring some order to that chaos.” “The amount of work required of everyone is the same. When it comes to convergence, everyone is held to the same standard. Following submission, all other submitters review and validate the results, and anyone can raise concerns or even contest the findings.
How long it takes to train an AI model to what is known as convergence is the most logical training metric. That entails correctly hitting a predetermined accuracy threshold. Salvator stated that it is an apples-to-apples comparison that accounts for workloads that are always changing.
The ChatGPT 170 5b task that was previously included in the benchmark has been replaced by the new Llama 3.140 5b workload this year. Salvator pointed out that Nvidia had several records in the benchmarks. The fabrication factories have just delivered the Nvidia GB200 NVL72 AI factories. Nvidia’s picture generation results improved by 2.5 times from the previous generation of chips (Hopper) to the next (Blackwell).
“As we continue to improve our software optimizations and as new, frank, heavier workloads enter the market, we fully expect to be getting more performance from the Blackwell architecture over time, as we’re still fairly early in the Blackwell product life cycle,” Salvator stated.
He pointed out that Nvidia was the only business to have entered every benchmark.
“A number of factors work together to give us the outstanding result we’re getting. In addition to other general architectural excellence in Blackwell, our fifth-generation NVLink and NVSwitch up are offering up to 2.66 times higher performance, and our continuous software improvements are what enable that performance,” Salvator stated.
“We have been known as those GPU guys for a very long time because of Nvidia’s heritage,” he continued. Although we still produce excellent GPUs, we have evolved from a chip company to a system company with products like our DGX servers. We are now building entire racks and data centers with products like our rack designs, which are now reference designs to help our partners get to market more quickly. These data centers, in turn, build out entire infrastructure, which we now refer to as AI factories. This adventure has been incredibly fascinating.