June 2, 2026|Emerging Tech·AI Hardware

Nemotron 3 Ultra — NVIDIA's Open AI Model That Runs Without the Cloud

Rafael Zacheu

7 min read

Jensen Huang has said variations of the same sentence throughout his career, but it never sounded as concrete as this week in Taipei: "We built the computer that runs agents." The RTX Spark is the hardware proof. The Nemotron 3 Ultra is the software proof.

Released June 2, 2026, it is the most capable open-weights AI model built in the United States — 550 billion parameters, a 1-million-token context window, 300+ tokens per second throughput. But the benchmark position is only part of the story. The more important story is how NVIDIA got here over three years of architectural bets, and what it means to run a model of this scale locally, with zero per-token cost, on hardware you own.

From 8 Billion to 550 Billion: The Timeline

The Nemotron family did not start as a play for frontier models. It started as a tool — the moment NVIDIA realized that dominating AI hardware without having a voice in the model ecosystem was a vulnerable position.

November 2023 — Nemotron-3 8B: Eight billion parameters, built for enterprise chatbot development inside NVIDIA's NeMo framework. Not a frontier model. A declaration of intent.

February–June 2024 — Nemotron-4: The technical report for the 15B model introduced a multilingual architecture trained on 8 trillion tokens. In June, the 340B family arrived with a focus on synthetic data generation and instruction tuning. This phase is when NVIDIA learned to build training pipelines at scale — the infrastructure everything else would depend on.

January 2025 (CES 2025) — Llama Nemotron: Reasoning models built on Meta's Llama architecture with proprietary post-training focused on planning and chain-of-thought. The first demonstration that NVIDIA could take a third-party open base and make it competitive with closed frontier models.

December 2025 — Nemotron 3 Nano: The decisive architectural break. The family adopted a hybrid Mamba-Transformer architecture with latent Mixture-of-Experts — combining Mamba's efficient long-sequence processing with transformer reasoning power, activating only a fraction of parameters per token. Result: 4× higher throughput, 1-million-token context window, 60% reduction in reasoning tokens. Small model, proof-of-concept for everything that followed.

March 2026 — Nemotron 3 Super: 120B total / 12B active parameters. Designed to solve the "thinking tax" — the computational cost of using heavy reasoning models for every subtask in a multi-agent system. In agentic pipelines, each reasoning round can generate up to 15× more tokens than a simple conversation, causing context explosion and goal drift. The Super scored 85.6% on PinchBench, the best open-weights model in its class.

June 2026 (Computex / GTC Taipei) — Nemotron 3 Ultra: 550B total / 55B active per token. Available from June 4 on Hugging Face and via NVIDIA NIM. Weights, training data, and replication recipes published openly.

What the Model Does

Specification	Value
Total parameters	~550 billion
Active parameters per token	~55 billion (MoE)
Architecture	Hybrid Mamba-Transformer + Latent MoE
Training format	NVFP4 (4-bit, Blackwell-optimized)
Context window	1 million tokens
Throughput	300+ tokens/second
Speed vs. comparable models	5× faster than GLM
Cost per agentic task	30% lower than alternatives
Intelligence Index (Artificial Analysis)	48 — #1 US open-weights
Availability	Hugging Face + NVIDIA NIM (from June 4, 2026)

The throughput and cost numbers come from two architectural decisions. First, NVFP4 precision: training and inference at 4-bit specifically optimized for Blackwell hardware — the model fits in less memory and runs faster without meaningful accuracy loss. Second, latent MoE: instead of activating all 550B parameters for every token, only the 55B most relevant to the specific task are engaged. Overhead stays low; capability stays high.

The 1-million-token context window is not a benchmark stat. One million tokens holds approximately 750,000 words — a full codebase, a year of project email threads, a complete legal document archive. Maintaining that context across hours of autonomous work, without losing the thread, is a distinct capability from raw reasoning power. Nemotron 3 Ultra was designed not to forget.

AI Intelligence Index — open vs closed models, June 2026

Frontier closed models (Anthropic / Google / OpenAI)Proprietary

57 pts

Kimi K2.6 — Moonshot AI (China)Open-weights

54 pts

Nemotron 3 Ultra — NVIDIAOpen-weights · US #1

48 pts

Previous-gen open-weights (2024 avg)2024

40 pts

The gap to frontier has closed from roughly 20 points two years ago to 9 points today. Jensen Huang confirmed that Nemotron 4 is in development. The trajectory is the story.

The Three-Model Family

Nemotron 3 is not a single model — it is a three-tier architecture built for intelligent workload routing in agentic systems.

	Nano	Super	Ultra
Role	High-frequency tasks	Multi-agent orchestrator	Deep reasoning engine
Parameters	30B total / 3B active	120B total / 12B active	550B total / 55B active
Primary use	Classification, summarization, edge deployment	Agentic pipelines, low-latency coordination	Scientific research, complex coding, long-horizon planning
Key metric	4× higher throughput vs Nemotron 2 Nano	85.6% PinchBench — best open-weights in class	Intelligence Index 48 — #1 US open-weights
Available	Now	March 2026	June 4, 2026

The routing logic: simple, high-frequency tasks (parsing, classification, summarization) go to Nano. Multi-agent coordination and software development go to Super. Deep reasoning, long-document analysis, and complex planning go to Ultra. The system evaluates task complexity and routes accordingly — maintaining low cost on routine tasks while reserving maximum capacity for when it genuinely matters.

Why the RTX Spark Synergy Matters

The Nemotron 3 family and the RTX Spark were not designed in parallel by separate teams. They were co-optimized from the beginning, and the integration runs deeper than compatibility.

Memory match: The RTX Spark was built with 128GB of unified memory precisely because models at the Super tier (120B parameters) need that space to run locally without swap. The Ultra in NVFP4 quantized format fits entirely in that same memory pool. No current consumer GPU — including the RTX 5090 with its 32GB of VRAM — can do this.

Format match: Nemotron 3 Ultra uses NVFP4 — a precision format developed by NVIDIA for the Blackwell architecture. The model was not adapted for this hardware after the fact; it was trained to run on this GPU. The acceleration is co-optimization from the training phase, not a runtime patch.

The token cost math: An agent running at 300 tokens/second for 8 hours of autonomous work generates approximately 8.6 million tokens. Via a cloud API, at typical frontier model pricing, that costs hundreds of dollars per day. Locally, the marginal cost is zero. For small teams, independent researchers, or small businesses, this change in the economic model may be more transformative than any benchmark score.

Privacy by architecture: With the model running locally and weights installed on the machine, there is no API call, no external log, no network traffic during inference. For legal, medical, and financial professionals operating under GDPR or CCPA, this is compliance by architecture — not by contractual trust in a cloud provider.

Ecosystem breadth: The Ultra deploys via NVIDIA NIM, vLLM, SGLang, Ollama, or llama.cpp. RTX Spark has native CUDA. Any developer already working with open-weights models can run the Ultra on Spark without relearning tools or rewriting pipelines.

Two Angles

For: The Intelligence Finally Belongs to the People Using It

For three years, access to frontier AI came with a hidden cost: your data. Every query sent to GPT-4, to Claude, to Gemini, traveled through servers that logged and potentially used that content for purposes the user never saw. The Nemotron 3 Ultra changes that dynamic structurally, not cosmetically.

With published weights, published training data, and published replication recipes, any developer can audit, modify, fine-tune, and deploy the model inside their own infrastructure. In healthcare, law, defense, and scientific research, that is the minimum condition for AI to be legally usable in sensitive scenarios. Local execution and auditability satisfy GDPR and CCPA requirements at the architectural level. The prior Nemotron family has accumulated over 50 million downloads in the 12 months before this launch — the community infrastructure already exists.

Against: Open Weights Are Not Open Source — And the Difference Matters

There is an equivalence NVIDIA actively cultivates: between "open weights" and "open source." They are not the same. The Nemotron 3 Ultra is published under the NVIDIA Open Model License — more permissive than closed licenses, but more restrictive than Apache 2.0 or MIT. Redistribution and modification restrictions exist. They do not appear in keynotes.

The hardware dependency is the second issue. Nemotron 3 Ultra was trained in NVFP4, optimized for Blackwell, and deployed via NVIDIA NIM on CUDA infrastructure. You can technically run the model on third-party hardware — but the performance demonstrated in benchmarks is performance on NVIDIA hardware. A "free" model that only performs at its stated capability on one company's hardware is a different kind of lock-in, not an absence of it.

Then there is the gap with China. The Nemotron 3 Ultra scores 48 on the Intelligence Index. The Kimi K2.6 from Moonshot AI, launched two months earlier, scores 54. Ultra is the best US open-weights model — and still trails the best open-weights model from China by 6 points. The narrative of American open-weights leadership is accurate but incomplete without that context.

Finally: a 550-billion-parameter model with open weights, capable of long-horizon autonomous planning, running locally without any external monitoring layer — is, by design, a model that operates without the observability that centralized cloud deployments provide. The same property that makes it attractive for legitimate privacy use cases also makes misuse harder to detect. That is not a reason not to build it. It is a reason to be clear-eyed about what "open" entails.

What This Means Going Forward

The last four years of AI were defined by centralization: large models running in enormous data centers, accessed via API, with every interaction passing through third-party servers. What NVIDIA is building is the infrastructure for a decentralized alternative — not because centralized AI is technically inferior, but because there is growing, legitimate demand for intelligence that does not depend on connectivity, does not carry variable cost, and does not require trusting your data to a third party.

The relevant question three years from now will not be "which model do I use?" It will be "who controls the model I use?" And the answer to that question, more than any benchmark score, will determine whether this era of AI was broadly beneficial.

Nemotron 3 Ultra is available now on Hugging Face and via NVIDIA NIM from June 4, 2026. For the hardware designed to run it, see our RTX Spark breakdown.

Tags:

nemotron 3 ultranvidia open ai model 2026open weights llm 2026Emerging Tech

NVIDIA RTX Spark — The AI Superchip That Reinvents the Personal Computer

Surface Laptop Ultra — The Laptop Microsoft Never Had the Courage to Make