Nvidia's new OpenWeight Nemotron 3 combines three different architectures to beat GPT-OSS and QUEN at super throughput

Multi-agent systems, designed to handle long-horizon tasks like software engineering or cybersecurity triaging, can generate up to 15 times the token volume of standard chat – jeopardizing their cost-effectiveness in handling enterprise tasks.

But today, Nvidia asked for help solving this problem Release of Nemotron 3 SuperA 120-billion-parameter hybrid model with postulated weights hugging face.

By merging disparate architectural philosophies – state-space models, transformers, and a novel “latent” mix—experts’ design – Nvidia is attempting to provide the specialized depth needed for agentic workflows without the typical bloat of dense logic models, and all while being available for commercial use under a mostly open weight.

Triple Hybrid Architecture

At the core of the Nemotron 3 Super is a sophisticated architectural triad that balances memory efficiency with precise logic. The model uses a Hybrid Mamba-Transformers BackboneWhich combines Mamba-2 layers with strategic transformer attention layers.

To understand the implications for enterprise production, consider the “needle in a haystack” problem. Mamba-2 layers act like a “fast-travel” highway system, handling the vast majority of sequence processing with linear-time complexity. This allows the model to maintain a huge 1-million-token context window without exploding the memory footprint of the KV cache. However, pure state-space models often struggle with associative recall.

To fix this, Nvidia has strategically inserted Transformer attention layers as “global anchors”, ensuring that the model can accurately retrieve specific facts hidden deep within a codebase or a stack of financial reports.

Beyond the spine, the model introduces Latent Mix of Experts (LatentMoE). Traditional mixture-of-experts (MOE) designs root tokens for experts in their full latent dimension, which poses a computational bottleneck as the model scales. LatentMoE solves this by interpolating tokens into a compressed space before sending them to experts.

This “expert compression” allows the model to consult four times more experts for the exact same computational cost. This granularity is important for agents who must switch between Python syntax, SQL logic, and conversational logic in a single step.

Multi-token prediction (MTP) is further accelerating the model. While standard models predict one next token, MTP predicts multiple future tokens simultaneously. It serves as an “underlying draft model”, enabling native speculative decoding that can provide up to 3x wall-clock speedup for structured generation tasks such as code or tool calls.

blackwell benefits

For enterprises, the most significant technological leap forward in Nemotron 3 Super is its optimization for the Nvidia Blackwell GPU platform. By originally pre-training in NVFP4 (4-bit floating point), Nvidia has achieved a breakthrough in production efficiency.

On Blackwell, the model produces inferences up to 4 times faster than 8-bit models running on the previous Hopper architecture, with no loss in accuracy.

In practical demonstration, the Nemotron 3 is a specialized device for super-agentic reasoning.

It is currently ranked No. 1 on the Deep Research Benchmark, a benchmark measuring AI’s ability to perform thorough, multi-step research across large document sets.

benchmark	nemotron 3 super	Qwen3.5-122B-A10B	GPT-OSS-120B
general knowledge
mmlu-pro	83.73	86.70	81.00
logic
AIME25(no device)	90.21	90.36	92.50
HMMT Feb 25 (no device)	93.67	91.40	90.00
HMMT February 25 (with equipment)	94.73	89.55	—
GPQA (no device)	79.23	86.60	80.10
GPQA (with tools)	82.70	—	80.09
LiveCodeBench (v5 2024-07↔2024-12)	81.19	78.93	88.00
sciencecode (subfunction)	42.05	42.00	39.00
HLE (no device)	18.26	25.30	14.90
HLE (with equipment)	22.82	—	19.0
agent
Terminal Bench (Hard Subset)	25.78	26.80	24.00
Terminal Bench Core 2.0	31.00	37.50	18.70
SWE-Bench (Openhand)	60.47	66.40	41.9
SWE-Bench (Opencode)	59.20	67.40	—
SWE-Bench (Codex)	53.73	61.20	—
SWE-Bench Multilingual (OpenHand)	45.78	—	30.80
Taubench V2
airline	56.25	66.0	49.2
retail	62.83	62.6	67.80
telecommunication	64.36	95.00	66.00
average	61.15	74.53	61.0
BrowseComp with Search	31.28	—	33.89
bird bench	41.80	—	38.25
Chat and follow instructions
IFBENCH (prompt)	72.56	73.77	68.32
Scale AI Multi-Challenge	55.23	61.50	58.29
arena-hard-v2	73.88	75.15	90.26
long affair
AA-LCR	58.31	66.90	51.00
ruler @ 256k	96.30	96.74	52.30
ruler@512k	95.67	95.95	46.70
ruler @ 1m	91.75	91.33	22.30
multilingual
mmlu-prox (average over langs)	79.36	85.06	76.59
WMT24++(en→xx)	86.67	87.84	88.89

It also demonstrates significant throughput gains, achieving up to 2.2x more throughput than gpt-oss-120B and 7.5x more than Qwen3.5-122B in high-volume settings.

Nvidia Nemotron 3 Super Key benchmark chart. NVIDIA

Custom ‘Open’ license – commercial use but with important warnings

Release of Nemotron 3 Super under nvidia open model license agreement (Updated October 2025) Provides a permissive framework for enterprise adoption, although it has separate “security” clauses that differentiate it from pure open-source licenses such as MIT or Apache 2.0.

Key provisions for enterprise users:

Commercial Utility: The license explicitly states that the models are “commercially usable” and grants a perpetual, worldwide, royalty-free license to sell and distribute products built on the models.
Ownership of Output: Nvidia makes no claims on the output generated by the model; Responsibility for—and ownership of—those outputs rests entirely with the user.
Derivative works: Enterprises are free to create and own “derivative models” (exact versions), provided they include the required attribution notices: “Licensed by Nvidia Corporation under the Nvidia Open Model License.”

“Red Lines”:

Licenses include two important expiration triggers that production teams should monitor:

Safety Railing: If a user bypasses or circumvents the model’s “guardrails” (technical limitations or security hyperparameters) without implementing a “substantially identical” replacement appropriate for the use case, the license automatically terminates.
Litigation Trigger: If a user files a copyright or patent lawsuit against Nvidia alleging that the model infringes their IP, their license to use the model is immediately terminated.

This structure allows Nvidia to foster a commercial ecosystem while protecting itself from “IP trolling” and ensuring that the model’s security features are not stripped out for malicious use.

‘The team really cooked’

The release has generated significant discussion within the developer community. Chris Alexiuk, a senior product research engineer at Nvidia, introduced the launch on X under his handle @llm_wizard As a “Super Day” emphasizing the speed and transparency of the model. “Model is: Fast. Model is: Smart. Model is: The most open model ever,” Chris highlighted the release of 10 trillion tokens of not only weights, but also training data and recipes.

Industry adoption reflects this enthusiasm:

Cloud and Hardware: The model is being positioned as a nvidia nim microservicesAllows it to be run on-premises via Dell AI Factory Or hpeAs well as on Google Cloud, Oracle, and soon, AWS and Azure.
Production Agent: companies like code rabbit (Software Development) and grptile are integrating models to handle large-scale codebase analysis, while industry leaders prefer siemens And palantir It is being deployed to automate complex workflows in manufacturing and cybersecurity.

As Kari Brisky, Nvidia VP of AI software, said: “As companies move beyond chatbots to multi-agent applications, they face a context explosion.”

The Nemotron 3 Super is Nvidia’s answer to that explosion – a model that delivers the “brainpower” of a 120b parameter system with the operational efficiency of a much smaller specialist. For enterprises, the message is clear: “thinking” is ultimately diminishing.

What's Hot

WealthTech is entering a new phase with real-time portfolio intelligence: Centricity’s Teens

Why does panic selling amid volatility cost you?

Automatic US military draft registration planned by December 2026

Nvidia’s new OpenWeight Nemotron 3 combines three different architectures to beat GPT-OSS and QUEN at super throughput

How to Optimize TikTok Content for Maximum Discoverability

How to crowdsource content ideas from your audience using social media

ChatGPT: 5 rewards apps to beat inflation

Mortgage Rates Today, Thursday, March 12: Slightly Higher

7 Smart AI Money Making Ideas to Try Today in 2026

Y Combinator-backed Random Labs launches Slate V1, claiming to be the first ‘swarm-native’ coding agent

3 real examples of how to handle overseas rental properties

How to Become a Substitute Teacher – and How Much You Can Earn

Top Insights