#4270 MSI DGX Spark
#4270 MSI DGX Spark
#4270 Describe performance of Granite-4-h-small on a desktop RTX 5090
The Granite-4.0-H-Small model on an RTX 5090 represents one of the most efficient "powerhouse" setups for local AI in 2026. Because Granite 4.0 uses a hybrid architecture (Mamba-2 + Transformer) and a Mixture-of-Experts (MoE) design, it punches significantly above its weight on consumer hardware.
## Performance Overview
The RTX 5090's 32GB of GDDR7 memory and massive bandwidth (1.79 TB/s) perfectly complement Granite's hybrid architecture, which is designed to minimize VRAM usage and maximize throughput.
| Metric | Estimated Performance (RTX 5090) |
| Inference Speed | ~60–80+ tokens/sec (4-bit/Q4_K_M quantization) |
| VRAM Usage | ~18–22 GB (leaving room for 128k+ context) |
| Context Handling | Extremely low "context rot" and speed degradation due to Mamba-2. |
| Throughput | Capable of handling massive batches (~5,000+ total t/s in parallel). |
### 1. Architectural Synergy
Active vs. Total Parameters: Granite-4.0-H-Small has 32B total parameters but only 9B active parameters per token.
On a 5090, this means you get the reasoning depth of a 32B model with the lightning-fast speed of a 9B model. Mamba-2 Hybrid: Unlike standard Transformers (like Llama 3), which slow down as the conversation gets longer, Granite's hybrid SSM architecture maintains nearly constant speed even as you approach its 128k (or extended) context limit.
Memory Efficiency: The 32GB VRAM on the RTX 5090 is the "sweet spot." While a 32B model would normally crowd a 24GB card (like the 4090) when using long context, the 5090 handles it comfortably with 10GB+ of breathing room for KV cache or background tasks.
### 2. Benchmark Expectations
In local environments (using tools like Ollama or LM Studio), you can expect:
Coding & Logic: Strong performance in Python and agentic tasks. It rivals models twice its size (like 70B variants) in instruction following (IFEval) while running 3x faster.
Zero-Lag Interaction: At ~80 t/s, the model generates text faster than a human can read, making it ideal for real-time RAG (Retrieval-Augmented Generation) where it has to "read" large documents before answering.
### 3. Hardware Considerations
Power & Heat: Running this model will push the 5090's 575W TDP.
Ensure your PSU is at least 1000W, as MoE models can cause rapid "spiky" power draws during active parameter switching. Quantization: For the best balance of "intelligence" and speed, Q4_K_M or Q5_K_M GGUF formats are recommended. This keeps the entire model on the GPU, avoiding the massive slowdown of offloading to system RAM.
Would you like me to find the specific commands to run Granite-4.0-H-Small via Ollama or llama.cpp for your setup?
Repeat for a DGX Spark with 128 GB RAM available to CPU and GPU
Running Granite-4.0-H-Small on the NVIDIA DGX Spark changes the performance profile from "raw speed" to "massive capacity." While the RTX 5090 wins on pure tokens per second, the DGX Spark’s 128GB of Coherent Unified Memory (shared between the Grace CPU and Blackwell GPU) allows for workflows that are physically impossible on a single consumer card.
## Performance Overview
The DGX Spark is powered by the GB10 Grace Blackwell Superchip. In this setup, the 128GB LPDDR5x RAM acts as a single pool for both the CPU and GPU, connected via NVLink-C2C.
| Metric | Estimated Performance (DGX Spark) |
| Inference Speed | ~40–50 tokens/sec (4-bit/Q4_K_M) |
| VRAM / Unified Usage | ~18–21 GB (Leaving ~100 GB for context) |
| Effective Context | Full 128k context with massive KV cache headroom. |
| Memory Bandwidth | 273 GB/s (Unified) vs 1.79 TB/s (RTX 5090) |
### 1. Key Advantages over RTX 5090
Massive Context Windows: While the 5090 is faster at generating text, its 32GB VRAM limits how much "memory" the model can have during a long chat. On the DGX Spark, you can load Granite-4.0-H-Small and still have 100GB of RAM remaining to store a massive context window (128k tokens and beyond) or large RAG document caches without needing to swap data to a slower SSD.
Coherent Unified Memory: Unlike a PC where data must travel over the PCIe bus from CPU RAM to GPU VRAM, the GB10 chip accesses all 128GB at the same speed. This makes the "prefill" stage (when the model reads your long prompt) much more efficient for massive inputs.
Multi-Model Hosting: You can keep Granite-4.0-H-Small loaded alongside other models (like an embedding model for RAG or a smaller "Micro" model for tool-calling) simultaneously without running out of memory.
### 2. The Trade-off: Bandwidth vs. Capacity
Lower Tokens/Sec: The DGX Spark uses LPDDR5x memory with a bandwidth of 273 GB/s. This is significantly slower than the 5090’s GDDR7 (1.79 TB/s). You will see a drop from ~80 t/s down to roughly 40–50 t/s. However, 45 t/s is still much faster than human reading speed (~5–10 t/s), so it remains highly usable.
Silent & Low Power: The DGX Spark is a 240W system designed to be "silent" on a desk. Compared to a 5090 rig which can pull 600W+ and requires heavy cooling, the Spark is a much more efficient dedicated "AI appliance."
### 3. Specialized Use Cases
Agentic Workflows: Because Granite-4.0-H-Small is optimized for tool-calling and agents, the DGX Spark is the superior choice for long-running autonomous agents. The extra 100GB of RAM allows the agent to keep a vast history of its actions and observations "active" in memory.
Local RAG Lab: If you are feeding the model entire textbooks or codebases, the unified memory architecture handles the "needle-in-a-haystack" retrieval tasks with more stability than a consumer card.
### Summary Comparison
RTX 5090: Best for speed and single-turn low-latency chat.
DGX Spark: Best for memory, long-context research, and running multiple AI agents at once.
Would you like me to compare the specific power draw and thermal footprints of these two setups for a home office environment?
Describe the performance of Nvidia Nemotron 3 Nano 30B A3B and Nemotron 3 Super 120B on a DGX Spark
Moving to the Nvidia Nemotron 3 family on the DGX Spark shifts the focus toward high-density reasoning and agentic workflows. Unlike the Granite models, Nemotron 3 is specifically engineered to exploit the Blackwell (GB10) architecture and its native NVFP4 (4-bit Floating Point) support.
On a DGX Spark with 128 GB of unified memory, these models perform as follows:
## 1. Nemotron 3 Nano (30B A3B)
The "Nano" is a hybrid Mamba-2/Transformer Mixture-of-Experts (MoE) model. Despite the "Nano" name, it has 30B total parameters, but only 3.2B–3.5B are active per token.
Inference Speed: ~55–75 tokens/sec (NVFP4).
Memory Footprint: ~18–22 GB for weights.
The "Agent" Advantage: On the DGX Spark, you can easily fit the model plus a 1-million-token context window. Because Mamba layers have a fixed-size state, the "memory tax" for long conversations is significantly lower than standard Transformers.
Reasoning Mode: It features a native
<think>mode. On the Spark, it can generate reasoning traces at high speed before providing a final answer, making it feel like a "mini-o1" on your desktop.
## 2. Nemotron 3 Super (120B A12B)
This is the flagship for the DGX Spark. It uses a LatentMoE architecture with 120B total parameters, but only 12B are active per token. This is the largest model the Spark can run comfortably with a significant context buffer.
| Metric | Performance on DGX Spark (Unified Memory) |
| Quantization | NVFP4 (Native) — Optimized specifically for the GB10 chip. |
| Inference Speed | ~15–25 tokens/sec (Reasoning OFF) / ~8–12 tokens/sec (Reasoning ON). |
| VRAM / RAM Usage | ~87 GB (for weights) + ~15-20 GB for KV Cache/System. |
| Context Limit | Practical limit of ~128k to 256k tokens (due to 128GB RAM ceiling). |
### Key Technical Insights
NVFP4 Native Training: Unlike other models that are "squeezed" into 4-bit after the fact, Nemotron 3 Super was trained in NVFP4. On the DGX Spark, this means you get near-BF16 accuracy at 4-bit memory costs.
Multi-Token Prediction (MTP): The Super model predicts multiple tokens at once. On the Spark's unified memory, this helps offset the lower bandwidth (273 GB/s) compared to a dedicated GPU, effectively "batching" the generation to feel smoother.
The 128GB Ceiling: Loading the 120B model in 4-bit takes up ~87 GB. While the model supports 1M context, you will likely hit an "Out of Memory" (OOM) error on a single Spark if you push past 256k tokens while running heavy agentic tasks.
## Summary: Performance Comparison
| Feature | Nemotron 3 Nano (30B) | Nemotron 3 Super (120B) |
| Primary Use | Fast, real-time tool calling. | Deep reasoning & complex planning. |
| Spark Experience | Instantaneous, "chatty" speed. | Methodical, high-intelligence output. |
| Capacity | Can run 3–4 instances at once. | Occupies ~75% of the Spark's resources. |
Pro Tip: To get the best performance for the Super 120B on your Spark, ensure you are using the vLLM or TensorRT-LLM backends with the
VLLM_USE_FLASHINFER_MOE_FP4=1environment variable enabled.
Would you like the specific Docker run commands to launch these Nemotron models on your DGX Spark?
How about the 120B version of GPT-OSS?
The GPT-OSS 120B (also known as gpt-oss-120b) is a landmark release from OpenAI’s "return to open source" in late 2025. On the DGX Spark, it is arguably the most capable "frontier" model you can run locally, rivaling the reasoning of proprietary models like o4-mini.
Because it is a Mixture-of-Experts (MoE) model with a high degree of sparsity (117B total parameters but only ~5.1B active), it feels significantly faster than its size suggests.
## Performance on DGX Spark
The Spark's 128GB of unified memory is the "ideal" home for this model, allowing it to run at its native MXFP4 (4-bit) precision with a massive context buffer.
| Metric | Estimated Performance (DGX Spark) |
| Quantization | MXFP4 (Native) — Only the experts are quantized to 4-bit. |
| Inference Speed | ~55–60 tokens/sec (vLLM / SGLang optimized) |
| VRAM / RAM Usage | ~62 GB (weights) + ~30-40 GB for KV Cache. |
| Context Limit | Comfortably handles 128k context with room to spare. |
| Throughput | High parallel throughput (~80+ t/s in multi-prompt batches). |
### 1. The Efficiency "Cheat Code"
The standout feature of GPT-OSS 120B is its Active-to-Total parameter ratio.
Sparsity: While it has 117B parameters worth of "knowledge," the DGX Spark only has to compute ~5B parameters for every word it generates.
Speed vs. Size: On the Spark, this makes it nearly as fast as the Nemotron 3 Nano (30B) despite having 4x more total parameters.
Latency: It features a very low Time-to-First-Token (TTFT), usually under 0.2s on the Spark's Grace Blackwell silicon.
### 2. Reasoning & Effort Modes
Unlike the Granite models, GPT-OSS 120B supports configurable reasoning effort (Low, Medium, High).
Low/Medium: Operates like a standard fast LLM.
High Reasoning: The model will "pause" to generate internal chain-of-thought tokens. On the Spark, these hidden reasoning tokens generate at ~70+ t/s, meaning even a deep "thought process" only adds a couple of seconds to the total response time.
### 3. Unified Memory Advantage
The GPT-OSS 120B weights take up roughly 62GB in MXFP4.
On an RTX 5090 (32GB): You literally cannot run this model locally without extreme 2-bit quantization or splitting it across multiple cards.
On the DGX Spark: It fits with 66GB of RAM left over. This "extra" memory is used by the system to hold a massive KV Cache, which is critical for the model's 128k context window. You can feed it a 100-page PDF, and it will "remember" the first page just as well as the last without slowing down.
### Summary: Why run GPT-OSS 120B on a Spark?
It is the only setup that gives you GPT-4 class reasoning in a desktop box that stays quiet and doesn't require a 1500W power circuit. While a dual-5090 setup might be slightly faster, the Spark's Unified Memory makes it more stable for the long-context agentic tasks this model was built for.
Would you like me to show you how to configure the "Reasoning Effort" levels in your local vLLM instance for this model?
From <https://gemini.google.com/app/> Google Gemini 3.0 (Fast)
Comments
Post a Comment