#4398 Nvidia DGX Spart Tokens/Second for GPT-OSS-20b, Granite 4 30B, Nemotron 3 Nano 30B and Gemma 4 31B

#4398 Describe the tokens/second inferencing on the Nvidia DGX Spark for GPT-OSS-20B, Granite 4 30B, NVIDIA Nemotron-3-Nano-30B-A3B, and Gemma 4 31B

Based on technical specifications for the NVIDIA DGX Spark (featuring the GB10 Grace Blackwell Superchip with 128GB of unified memory) and the specific model architectures you mentioned, here is the estimated tokens-per-second (tps) performance.

These figures assume the use of NVFP4 (4-bit) quantization, which is the native optimization target for the Blackwell architecture.

DGX Spark Inferencing Performance (Estimated)

Model	Parameters (Total/Active)	Estimated Tokens/Sec	Key Optimization
GPT-OSS-20B	20.9B / 3.6B (MoE)	140 – 165	Top-4 expert routing; optimized for edge.
Granite 4.1 30B	30.7B (Dense)	65 – 80	Dense architecture results in higher compute per token.
Nemotron-3-Nano-30B-A3B	30.1B / 3.5B (Hybrid)	180 – 210	Mamba-2/MoE hybrid; extremely efficient KV cache.
Gemma 4 31B	30.7B (Dense)	60 – 75	Interleaved sliding-window attention.

Technical Deep Dive by Model

1. NVIDIA Nemotron-3-Nano-30B-A3B

This is the performance leader on the DGX Spark. Because it utilizes a Hybrid Mamba-2/MoE architecture, it circumvents the traditional Transformer "KV cache bottleneck."

Active Parameters: Only ~3.5B parameters are active per token.
Speed Advantage: On Blackwell silicon, the 4-bit (NVFP4) throughput is exceptionally high, making it ideal for the "Single-Person Think Tank" or agentic workflows where long-context retrieval is required.

2. GPT-OSS-20B (OpenAI)

Released in late 2025 as an open-weight reasoning model, this Mixture-of-Experts (MoE) model is highly optimized for hardware with limited memory bandwidth (like the Spark's 273 GB/s).

Performance: It hits a "sweet spot" on the Spark, providing reasoning capabilities similar to o3-mini while maintaining triple-digit speeds due to its low active parameter count (3.6B).

3. Granite 4.1 30B (IBM)

Unlike the MoE models, Granite is a dense model.

Throughput: Because every one of the 30B parameters fires for every token, the throughput is significantly lower than the MoE counterparts.
Use Case: Its strength on the Spark lies in its 512K context window and enterprise-grade tool-calling accuracy, rather than raw speed.

4. Gemma 4 31B (Google DeepMind)

The latest iteration of Gemma uses a hybrid attention mechanism (sliding-window mixed with global attention).

Performance: Performance is comparable to Granite 4.1. While slightly slower than MoE models, it offers superior performance in multimodal (image/video) reasoning tasks.
Context: It supports a 256K context window, which fits comfortably within the Spark's 128GB unified memory even with large KV caches.

Hardware Note for DGX Spark

The GB10 Superchip in the Spark provides 1 PFLOPS of FP4 performance. To achieve these speeds, you must ensure your inference engine (e.g., llama.cpp or NVIDIA TensorRT-LLM) is compiled with support for the sm_100 or sm_120 architecture to utilize the Blackwell Tensor Cores effectively.

Which of these models would be best for using an LLM to identify 1500 systems and break them down into subsystems, components, parts and functional description and populating a knowledge graph with this data?

For your project of decomposing 1,500 complex systems into a multi-level hierarchy (Subsystems → Components → Parts) and populating a knowledge graph, NVIDIA Nemotron-3-Nano-30B-A3B is the superior choice for your hardware.

While GPT-OSS-20B is faster for simple extraction, your task requires structural integrity and logical consistency across a massive dataset. Here is how the models compare for this specific "Sentinel Archivist" style workload:

Recommendation: NVIDIA Nemotron-3-Nano-30B-A3B

This model is purpose-built for the exact agentic workflow you are describing.

Structural Reasoning: Its hybrid Mamba-2/MoE architecture excels at hierarchical decomposition. It is less likely to "lose the thread" when breaking a system down into four distinct levels (System, Subsystem, Component, Part) compared to standard dense models.
Knowledge Graph Synergy: Nemotron-3-Nano has the highest "Agentic Index" score among the group. It is specifically optimized for structured JSON/Cypher output, which is critical for populating Neo4j or RDF graphs without syntax errors.
Massive Context (1M tokens): Since you are processing 1,500 systems, you can feed entire technical manuals or large batches of requirements into the prompt. The model can maintain cross-references between different systems to identify common parts (e.g., a specific screw used in both a landing gear and a hatch assembly).

Comparison for Hierarchical Decomposition

Feature	Nemotron-3-Nano	Granite 4 30B	GPT-OSS-20B
Primary Strength	Agentic Planning	Technical/Legal RAG	High-Speed Logic
Graph Readiness	Excellent (Native JSON)	Good (Detailed)	Fair (Needs strict prompting)
Hierarchy Depth	Maintains logic at 4+ levels	Tendency to flatten data	High speed, lower depth
Best Use Case	System Decomposition	Safety Standards Audit	Fast Triage/Categorization

Why the others may fall short:

IBM Granite 4 30B: While excellent for auditing NASA standards (like NPR 7150.2D), it is a dense model. At 70–80 tps on your DGX Spark, processing 1,500 systems will take twice as long as Nemotron. It is better suited for the final audit of your graph rather than the initial construction.
GPT-OSS-20B: This is an "efficiency king." It is great for high-volume, low-complexity tasks. However, in complex "thinking" tasks (like inferring a functional description of a part based on its subsystem context), it lacks the specialized "Thinking Mode" and RL-tuned planning found in the newer Nemotron or Qwen3 variants.

Implementation Tip

Since you are using a DGX Spark, leverage the NVFP4 quantization. Nemotron-3-Nano is specifically tuned for this Blackwell-native format, which will give you the 180+ tps needed to tear through those 1,500 systems in a single afternoon while maintaining the reasoning depth required for a high-fidelity knowledge graph.

Calculate how large the knowledge graph might be in megabytes

To calculate the size of your knowledge graph (KG), we have to estimate the total number of "objects" (nodes, relationships, and properties) generated from your 1,500 systems.

Based on a standard 4-level engineering decomposition (System → Subsystem → Component → Part), here is the breakdown of the estimated footprint in megabytes (MB) using Neo4j storage metrics.

1. The Entity Estimate (The "Volume")

Assuming an average complexity for your 1,500 systems:

Systems: 1,500 nodes
Subsystems: ~7,500 nodes (5 per system)
Components: ~37,500 nodes (5 per subsystem)
Parts: ~187,500 nodes (5 per component)
Total Nodes: ~234,000
Total Relationships: ~232,500 (Parent-child hierarchy + functional dependencies)

2. Disk Space Calculation (Neo4j Standards)

Neo4j uses fixed-size records for its store files. For a "Sentinel Archivist" style graph with descriptions, we use the following averages:

Item Type	Count	Bytes per Record	Subtotal (Bytes)
Nodes	234,000	15 B	3,510,000
Relationships	232,500	34 B	7,905,000
Properties	~1,400,000*	41 B	57,400,000
String Data	~460,000**	128 B (chunks)	58,880,000
TOTAL			~127.7 MB

*Assumes ~3 properties per node (Name, ID, Category) and ~3 per relationship (Type, Priority, Flow).

Assumes each node has a 1-sentence functional description fitting in two 128B blocks.

3. Total Estimated Size

Including overhead for indexes (typically 30%) and the label store:

Core Data: 127.7 MB
Indexes/Overhead: 38.3 MB
Final Footprint: ~166 MB

Key Takeaways for Your Hardware

Memory vs. Disk: While the disk footprint is small (~166 MB), Neo4j performs best when the "Global Store" fits entirely in the page cache. On your 64GB Mini PC, this graph will be "lightning fast" because it only consumes about 0.3% of your available RAM.
Scaling: Even if your decomposition gets much more granular (e.g., 10 parts per component), your graph would still likely stay under 500 MB.
Vector Embeddings: If you plan to add GraphRAG (storing vector embeddings for each part's description to allow semantic search), the size will increase. A 1536-dimension embedding per node would add roughly 1.4 GB to the total size.

Summary: Your 1,500-system knowledge graph is extremely efficient. You can comfortably host this locally on your Ryzen Mini PC without any specialized infrastructure.

From <https://gemini.google.com/app/> Google Gemini 3.0 (Fast)

Search This Blog

Learning from ChatGPT