
Nemotron 3 Ultra — NVIDIA's Open AI Model That Runs Without the Cloud
Rafael Zacheu
7 min read
Jensen Huang has said variations of the same sentence throughout his career, but it never sounded as concrete as this week in Taipei: "We built the computer that runs agents." The RTX Spark is the hardware proof. The Nemotron 3 Ultra is the software proof.
Released June 2, 2026, it is the most capable open-weights AI model built in the United States — 550 billion parameters, a 1-million-token context window, 300+ tokens per second throughput. But the benchmark position is only part of the story. The more important story is how NVIDIA got here over three years of architectural bets, and what it means to run a model of this scale locally, with zero per-token cost, on hardware you own.
From 8 Billion to 550 Billion: The Timeline
The Nemotron family did not start as a play for frontier models. It started as a tool — the moment NVIDIA realized that dominating AI hardware without having a voice in the model ecosystem was a vulnerable position.
November 2023 — Nemotron-3 8B: Eight billion parameters, built for enterprise chatbot development inside NVIDIA's NeMo framework. Not a frontier model. A declaration of intent.
February–June 2024 — Nemotron-4: The technical report for the 15B model introduced a multilingual architecture trained on 8 trillion tokens. In June, the 340B family arrived with a focus on synthetic data generation and instruction tuning. This phase is when NVIDIA learned to build training pipelines at scale — the infrastructure everything else would depend on.
January 2025 (CES 2025) — Llama Nemotron: Reasoning models built on Meta's Llama architecture with proprietary post-training focused on planning and chain-of-thought. The first demonstration that NVIDIA could take a third-party open base and make it competitive with closed frontier models.
December 2025 — Nemotron 3 Nano: The decisive architectural break. The family adopted a hybrid Mamba-Transformer architecture with latent Mixture-of-Experts — combining Mamba's efficient long-sequence processing with transformer reasoning power, activating only a fraction of parameters per token. Result: 4× higher throughput, 1-million-token context window, 60% reduction in reasoning tokens. Small model, proof-of-concept for everything that followed.
March 2026 — Nemotron 3 Super: 120B total / 12B active parameters. Designed to solve the "thinking tax" — the computational cost of using heavy reasoning models for every subtask in a multi-agent system. In agentic pipelines, each reasoning round can generate up to 15× more tokens than a simple conversation, causing context explosion and goal drift. The Super scored 85.6% on PinchBench, the best open-weights model in its class.
June 2026 (Computex / GTC Taipei) — Nemotron 3 Ultra: 550B total / 55B active per token. Available from June 4 on Hugging Face and via NVIDIA NIM. Weights, training data, and replication recipes published openly.
What the Model Does
| Specification | Value |
|---|---|
| Total parameters | ~550 billion |
| Active parameters per token | ~55 billion (MoE) |
| Architecture | Hybrid Mamba-Transformer + Latent MoE |
| Training format | NVFP4 (4-bit, Blackwell-optimized) |
| Context window | 1 million tokens |
| Throughput | 300+ tokens/second |
| Speed vs. comparable models | 5× faster than GLM |
| Cost per agentic task | 30% lower than alternatives |
| Intelligence Index (Artificial Analysis) | 48 — #1 US open-weights |
| Availability | Hugging Face + NVIDIA NIM (from June 4, 2026) |
The throughput and cost numbers come from two architectural decisions. First, NVFP4 precision: training and inference at 4-bit specifically optimized for Blackwell hardware — the model fits in less memory and runs faster without meaningful accuracy loss. Second, latent MoE: instead of activating all 550B parameters for every token, only the 55B most relevant to the specific task are engaged. Overhead stays low; capability stays high.
The 1-million-token context window is not a benchmark stat. One million tokens holds approximately 750,000 words — a full codebase, a year of project email threads, a complete legal document archive. Maintaining that context across hours of autonomous work, without losing the thread, is a distinct capability from raw reasoning power. Nemotron 3 Ultra was designed not to forget.
AI Intelligence Index — open vs closed models, June 2026
The gap to frontier has closed from roughly 20 points two years ago to 9 points today. Jensen Huang confirmed that Nemotron 4 is in development. The trajectory is the story.
The Three-Model Family
Nemotron 3 is not a single model — it is a three-tier architecture built for intelligent workload routing in agentic systems.
| Nano | Super | Ultra | |
|---|---|---|---|
| Role | High-frequency tasks | Multi-agent orchestrator | Deep reasoning engine |
| Parameters | 30B total / 3B active | 120B total / 12B active | 550B total / 55B active |
| Primary use | Classification, summarization, edge deployment | Agentic pipelines, low-latency coordination | Scientific research, complex coding, long-horizon planning |
| Key metric | 4× higher throughput vs Nemotron 2 Nano | 85.6% PinchBench — best open-weights in class | Intelligence Index 48 — #1 US open-weights |
| Available | Now | March 2026 | June 4, 2026 |
The routing logic: simple, high-frequency tasks (parsing, classification, summarization) go to Nano. Multi-agent coordination and software development go to Super. Deep reasoning, long-document analysis, and complex planning go to Ultra. The system evaluates task complexity and routes accordingly — maintaining low cost on routine tasks while reserving maximum capacity for when it genuinely matters.
Why the RTX Spark Synergy Matters
The Nemotron 3 family and the RTX Spark were not designed in parallel by separate teams. They were co-optimized from the beginning, and the integration runs deeper than compatibility.
Memory match: The RTX Spark was built with 128GB of unified memory precisely because models at the Super tier (120B parameters) need that space to run locally without swap. The Ultra in NVFP4 quantized format fits entirely in that same memory pool. No current consumer GPU — including the RTX 5090 with its 32GB of VRAM — can do this.
Format match: Nemotron 3 Ultra uses NVFP4 — a precision format developed by NVIDIA for the Blackwell architecture. The model was not adapted for this hardware after the fact; it was trained to run on this GPU. The acceleration is co-optimization from the training phase, not a runtime patch.
The token cost math: An agent running at 300 tokens/second for 8 hours of autonomous work generates approximately 8.6 million tokens. Via a cloud API, at typical frontier model pricing, that costs hundreds of dollars per day. Locally, the marginal cost is zero. For small teams, independent researchers, or small businesses, this change in the economic model may be more transformative than any benchmark score.
Privacy by architecture: With the model running locally and weights installed on the machine, there is no API call, no external log, no network traffic during inference. For legal, medical, and financial professionals operating under GDPR or CCPA, this is compliance by architecture — not by contractual trust in a cloud provider.
Ecosystem breadth: The Ultra deploys via NVIDIA NIM, vLLM, SGLang, Ollama, or llama.cpp. RTX Spark has native CUDA. Any developer already working with open-weights models can run the Ultra on Spark without relearning tools or rewriting pipelines.
Two Angles
For: The Intelligence Finally Belongs to the People Using It
For three years, access to frontier AI came with a hidden cost: your data. Every query sent to GPT-4, to Claude, to Gemini, traveled through servers that logged and potentially used that content for purposes the user never saw. The Nemotron 3 Ultra changes that dynamic structurally, not cosmetically.
With published weights, published training data, and published replication recipes, any developer can audit, modify, fine-tune, and deploy the model inside their own infrastructure. In healthcare, law, defense, and scientific research, that is the minimum condition for AI to be legally usable in sensitive scenarios. Local execution and auditability satisfy GDPR and CCPA requirements at the architectural level. The prior Nemotron family has accumulated over 50 million downloads in the 12 months before this launch — the community infrastructure already exists.
Against: Open Weights Are Not Open Source — And the Difference Matters
There is an equivalence NVIDIA actively cultivates: between "open weights" and "open source." They are not the same. The Nemotron 3 Ultra is published under the NVIDIA Open Model License — more permissive than closed licenses, but more restrictive than Apache 2.0 or MIT. Redistribution and modification restrictions exist. They do not appear in keynotes.
The hardware dependency is the second issue. Nemotron 3 Ultra was trained in NVFP4, optimized for Blackwell, and deployed via NVIDIA NIM on CUDA infrastructure. You can technically run the model on third-party hardware — but the performance demonstrated in benchmarks is performance on NVIDIA hardware. A "free" model that only performs at its stated capability on one company's hardware is a different kind of lock-in, not an absence of it.
Then there is the gap with China. The Nemotron 3 Ultra scores 48 on the Intelligence Index. The Kimi K2.6 from Moonshot AI, launched two months earlier, scores 54. Ultra is the best US open-weights model — and still trails the best open-weights model from China by 6 points. The narrative of American open-weights leadership is accurate but incomplete without that context.
Finally: a 550-billion-parameter model with open weights, capable of long-horizon autonomous planning, running locally without any external monitoring layer — is, by design, a model that operates without the observability that centralized cloud deployments provide. The same property that makes it attractive for legitimate privacy use cases also makes misuse harder to detect. That is not a reason not to build it. It is a reason to be clear-eyed about what "open" entails.
What This Means Going Forward
The last four years of AI were defined by centralization: large models running in enormous data centers, accessed via API, with every interaction passing through third-party servers. What NVIDIA is building is the infrastructure for a decentralized alternative — not because centralized AI is technically inferior, but because there is growing, legitimate demand for intelligence that does not depend on connectivity, does not carry variable cost, and does not require trusting your data to a third party.
The relevant question three years from now will not be "which model do I use?" It will be "who controls the model I use?" And the answer to that question, more than any benchmark score, will determine whether this era of AI was broadly beneficial.
Nemotron 3 Ultra is available now on Hugging Face and via NVIDIA NIM from June 4, 2026. For the hardware designed to run it, see our RTX Spark breakdown.
Tags: