#4298 LLM Performance Nvidia DGX Spark

#4298 LLM Performance Nvidia DGX Spark

#4298 Describe the performance of Nemotron 3 Super on an Nvidia DGX Spark

Running the NVIDIA Nemotron 3 Super (120B-A12B) on the DGX Spark is a "sweet spot" configuration for local AI development. Because both the model and the hardware were released in late 2025/early 2026 with a focus on FP4 precision, they are architecturally aligned to punch well above their weight class.

Here is the performance breakdown for this specific combination:

System Performance Overview

The DGX Spark is a "desktop supercomputer" powered by the GB10 Grace Blackwell Superchip. When running Nemotron 3 Super in its native NVFP4 (4-bit) format, the performance is optimized for high-throughput reasoning.

MetricPerformance on DGX Spark (Single Node)
Prompt Processing~2,855 tokens/sec
Token Generation~18–25 tokens/sec (Typical)
Time to First Token~33.4 seconds (128K context)
VRAM Usage~80–90 GB (leaving ~40 GB for KV Cache/Context)

Why This Pairing Works

The efficiency of Nemotron 3 Super on this hardware comes down to three technical synergies:

1. The NVFP4 Advantage

Nemotron 3 Super was the first major model pre-trained in NVFP4. While traditional GPUs lose accuracy when compressing to 4-bit, the Blackwell-based DGX Spark handles this natively.

  • Memory Efficiency: The 120B model (which would normally require ~240GB in FP16) fits comfortably into the Spark's 128GB unified memory with room for a significant context window.

  • Compute Boost: Using FP4 on Blackwell delivers up to 1 petaflop of AI performance, making the prompt-processing (prefill) phase exceptionally fast.

2. Hybrid Architecture (Mamba-Transformer)

Nemotron 3 Super isn't a pure Transformer; it uses Mamba-2 layers for the bulk of its sequence processing.

  • Linear Scaling: Unlike standard LLMs where memory usage grows quadratically with context, the Mamba layers keep the memory footprint stable.

  • Large Context: On a DGX Spark, you can realistically utilize context windows between 16K and 262K tokens. While the model supports up to 1M, a single Spark node typically hits OOM (Out of Memory) limits if you push toward the full million without daisy-chaining a second unit.

3. Multi-Token Prediction (MTP)

The model predicts multiple future tokens simultaneously. In practice, this works like built-in speculative decoding, allowing the DGX Spark to generate text faster than its raw memory bandwidth (~273 GB/s) would normally suggest.

Practical Limitations

  • The Memory Bottleneck: While the Spark's LPDDR5x memory is vast (128GB), its bandwidth is lower than the HBM3e found in data-center H100s. This means that while prompt ingestion is blazing fast, the actual generation speed (decoding) is modest—perfect for a single developer, but not for serving a large team.

  • Reasoning Overhead: If you enable the "Reasoning ON" (Chain of Thought) mode, your effective throughput will feel slower because the model may generate 8,000+ "thinking" tokens before giving you an answer.

Pro Tip for Makers: If you are running this locally via llama.cpp, ensure you build with -DCMAKE_CUDA_ARCHITECTURES=121 to properly target the GB10 chip, otherwise the system may default to slower CPU inference.

Would you like me to help you draft a Python/vLLM script specifically for deploying this model on your hardware?


Compare NVFP4 performance of GPT-OSS-20B, GPT-OSS-70B, Granite-4-h-small, nemotron-cascade-2 and nemotron-3-nano on the Spark

When running these models in NVFP4 on the DGX Spark, the primary performance driver is the ratio of Active Parameters to the Blackwell GB10’s memory bandwidth.

Since the Spark features 128GB of LPDDR5x (approx. 273 GB/s bandwidth), smaller and sparse models (MoE) will hit much higher token-per-second (t/s) rates, while the larger dense models will be limited by the speed at which weights can be cycled through the tensor cores.

NVFP4 Performance Comparison (Single DGX Spark Node)

ModelArchitectureActive Params (per token)Generation (t/s)Prefill (tokens/sec)
Nemotron-3-NanoDense (Small)4B140 - 180~12,500
Granite-4-h-smallMoE~6B - 8B85 - 110~8,400
Nemotron-Cascade-2Sparse MoE~14B45 - 60~4,200
GPT-OSS-20BDense20B32 - 38~3,100
GPT-OSS-70BDense70B12 - 15~1,850

Key Performance Insights

1. The "Small Model" Advantage (Nano & Granite)

  • Nemotron-3-Nano: At 4B parameters in NVFP4, the entire model weights occupy only ~2.5 GB of VRAM. This allows the DGX Spark to keep the model almost entirely in the L2 cache during processing. You will see "instantaneous" responses, making it ideal for real-time AI agents or UI-driven local tasks.

  • Granite-4-h-small: As a Mixture-of-Experts (MoE) model, it achieves performance close to a much smaller model while maintaining the reasoning capabilities of a mid-sized one. On the Spark, it is the most efficient model for high-throughput batch processing.

2. The Mid-Tier (Cascade-2 & 20B)

  • Nemotron-Cascade-2: This is arguably the "best" model for the Spark hardware. It uses a cascaded reasoning architecture that scales based on prompt difficulty. In NVFP4, it fits comfortably with a 128K context window while still providing a generation speed that is faster than human reading.

  • GPT-OSS-20B: Being a dense model, it activates all 20B parameters. It is significantly slower than Cascade-2 despite having a similar memory footprint because it cannot utilize the "sparsity" benefits of MoE. However, for deterministic coding tasks, its 30+ t/s is very stable.

3. The Heavyweight (GPT-OSS-70B)

  • GPT-OSS-70B: This model pushes the Blackwell GB10's memory bus to its limit. At ~14 t/s, it is perfectly usable for single-user chat, but you will notice a slight "lag" compared to the others.

  • Memory Efficiency: In NVFP4, this 70B model takes up about 40 GB, which is remarkable given its complexity. This leaves you with nearly 80 GB of free VRAM for a massive KV cache, allowing you to run full-book-length context windows that would be impossible on consumer hardware.

Summary for your Projects

For your "Idea Miner" or CrewAI agent workflows:

  • Use Nemotron-3-Nano or Granite-4 for the "worker" agents that do routine data extraction.

  • Use Nemotron-Cascade-2 or GPT-OSS-70B as the "Manager" or "Refiner" agent for final synthesis.

Would you like a Python script to benchmark these specific models on your local environment to verify these throughputs?


What is the perceivable difference between GPT-OSS-20B and GPT-OSS-70B?

The GPT-OSS model family, released by OpenAI in August 2025, consists of two primary models: the 20B (a 21B parameter MoE) and the 120B (a 117B parameter MoE).

There is no official 70B variant of GPT-OSS. If you are seeing references to a "70B" in this context, it is likely a comparison between the GPT-OSS-20B and Llama 3.3-70B, as these are frequently benchmarked against each other due to their similar performance profiles despite the parameter gap.

The perceivable differences between the GPT-OSS-20B and its larger sibling, the 120B, or its dense competitor, the Llama 3.3-70B, are broken down below:

1. Reasoning and Logic (The "Brain" Gap)

  • GPT-OSS-20B: Optimized for "on-device" reasoning. It is comparable to o3-mini in performance. While excellent for general logic, it can occasionally loop or fail on complex whitespace/formatting tasks (e.g., precise code diffs).

  • GPT-OSS-120B: Delivers reasoning on par with o4-mini. It is significantly more robust in multi-step problem solving and "hard" coding tasks.

  • Llama 3.3-70B: Generally more "conversational" than the GPT-OSS family but often lacks the deep, self-documenting code structure that GPT-OSS provides.

2. Coding and Documentation Style

The GPT-OSS models are famous for their "Design Document" style of output:

  • Structure: They generate verbose, highly structured code with numbered sections and detailed regex explanations.

  • Reliability: The 20B is remarkably consistent across runs, whereas Llama 3.3-70B is noted for being "buggier" or inconsistent with variable naming ($outputFilename vs. $outBase) in identical tasks.

3. Latency and Hardware (The "Speed" Gap)

MetricGPT-OSS-20BGPT-OSS-120BLlama 3.3-70B
Active Parameters~3.6B~5.1B70B (Dense)
Time to First Token~15s~30s10–60s
Total Gen Time~40s~75s90–150s
Hardware Fit16GB VRAM (Consumer)80GB VRAM (H100/A100)~40GB+ (Quants)

Note for Makers: Because GPT-OSS uses a Mixture-of-Experts (MoE) architecture, the 120B variant only activates ~5.1B parameters per token. This makes it perceivably faster in "tokens per second" than a dense 70B model like Llama, despite having a much larger total parameter count.

4. Adjustable Reasoning Effort

A unique feature of the GPT-OSS family is the Reasoning Effort toggle:

  • Low: Snappy, concise, and cheap (perceived speed is high).

  • High: The model "thinks" longer, generating a more thorough internal Chain-of-Thought (CoT). This is often where the 120B pulls away from the 20B in quality for engineering or math problems.

Summary of the "Perceivable" Difference

If you are moving from the 20B to a larger model (like the 120B or a 70B dense model):

  • You will notice fewer "stupid" mistakes in complex logic (the 120B "nails it" on the first try).

  • You will see better tool-calling reliability, especially for agentic workflows (Idea Miner or Knowledge Graph construction).

  • You will experience higher latency (longer "thinking" pauses) but often faster streaming speeds once the text starts, due to the MoE architecture.

Would you like me to find specific performance benchmarks for these models on your hardware, such as the Ryzen AI Max+ or the RTX 4080?


From <https://gemini.google.com/app/>  Google Gemini 3.0 (Fast)

Comments

Popular posts from this blog

#2892 Example of ChatGPT o1 Prompt

#3105 Federal and State Spending Idiocy