NVIDIA Blackwell Ultra GB300 Delivers 50x Performance Boost for AI Agents

Paxful
Enhancing Data Deduplication with RAPIDS cuDF: A GPU-Driven Approach
Ledger




Terrill Dicki
Feb 16, 2026 17:24

NVIDIA’s GB300 NVL72 systems show 50x better throughput per megawatt and 35x lower token costs versus Hopper, with Microsoft, CoreWeave deploying at scale.





NVIDIA’s next-generation Blackwell Ultra platform is delivering dramatic cost and performance improvements for AI inference workloads, with new benchmark data showing the GB300 NVL72 achieves up to 50x higher throughput per megawatt and 35x lower cost per token compared to the previous Hopper generation.

The performance gains arrive as AI coding assistants and agentic applications have surged from 11% to roughly 50% of all AI queries over the past year, according to OpenRouter’s State of Inference report. These workloads demand both low latency for real-time responsiveness and long context windows when reasoning across entire codebases—exactly where Blackwell Ultra excels.

Major Cloud Providers Already Deploying

Microsoft, CoreWeave, and Oracle Cloud Infrastructure are rolling out GB300 NVL72 systems in production environments. The deployments follow successful GB200 NVL72 implementations that began shipping in late 2025, with inference providers like Baseten, DeepInfra, Fireworks AI, and Together AI already reporting 10x reductions in cost per token on the earlier Blackwell systems.

“As inference moves to the center of AI production, long-context performance and token efficiency become critical,” said Chen Goldberg, senior vice president of engineering at CoreWeave. “Grace Blackwell NVL72 addresses that challenge directly.”

Betfury

Technical Improvements Driving Gains

The performance leap stems from NVIDIA’s codesign approach across hardware and software. Key improvements include higher-performance GPU kernels optimized for low latency, NVLink Symmetric Memory enabling direct GPU-to-GPU access, and programmatic dependent launch that minimizes idle time between operations.

Software optimizations from NVIDIA’s TensorRT-LLM and Dynamo teams have delivered up to 5x better performance on GB200 systems for low-latency workloads compared to just four months ago—gains that compound with the hardware improvements in GB300.

For long-context scenarios involving 128,000-token inputs with 8,000-token outputs, GB300 NVL72 delivers 1.5x lower cost per token than GB200 NVL72. The improvement comes from 1.5x higher NVFP4 compute performance and 2x faster attention processing in the Blackwell Ultra architecture.

What’s Next

NVIDIA is already previewing the Rubin platform as the successor to Blackwell, promising another 10x throughput improvement per megawatt for mixture-of-experts inference. The company claims Rubin can train large MoE models using one-fourth the GPUs required by Blackwell.

For organizations evaluating AI infrastructure investments, the GB300 NVL72 represents a significant inflection point. With rack-scale systems reportedly priced around $3 million and production ramping through early 2026, the economics of running agentic AI workloads at scale are shifting rapidly. The 35x cost reduction at low latencies could fundamentally change which AI applications become commercially viable.

Image source: Shutterstock



Source link

fiverr

Be the first to comment

Leave a Reply

Your email address will not be published.


*