#3179 Ask Chat: How Can I Get Consistent, Deterministic Results from my LLM?

#3179 How can I realistically get consistent, deterministic results from my LLM?

“Deterministic” means every run of the same prompt returns the identical token sequence. With today’s LLM stacks you can get very close, but there are three different randomness layers you have to lock down:

Layer	Where the randomness creeps in	What you can realistically do
Decoder / sampling	Temperature, top-p, top-k, nucleus/beam sampling	Force a greedy decode: `temperature = 0`, `top_p = 1`, `top_k = 1` (or simply omit the samplers entirely). This removes algorithmic randomness. GitHub
Seeder / RNG state	Each call seeds its random-number generator	Pass an explicit, fixed seed. OpenAI exposes a `seed` field in ChatCompletion (beta) and returns a `system_fingerprint` so you can verify the backend hasn’t changed. OpenAI CookbookOpenAI Platform
Implementation / hardware	Non-deterministic kernels, multithreading, model updates	• Cloud APIs: Pin an immutable model ID (e.g., `gpt-4-0125-preview`) and cache the `system_fingerprint`. Updates to the model break determinism by design, so cache responses you must reuse. • Local inference: - Use the exact same quantised `.gguf/.bin` file, tokenizer build, and inference engine version. - Fix `--seed`, set `--threads 1` (or CPU-only) to avoid thread-ordering jitter. - Turn on deterministic CUDA flags or stay on CPU; GPU kernels can diverge by 1 FP bit. GitHubStack Overflow

Putting it into practice

Freeze the model

# Ollama
ollama pull llama3:8b-q4_0
# Keep the file and record the SHA256

Call with a locked sampler and seed

ollama run llama3:8b-q4_0 \
    --prompt "Explain Newton’s third law in one sentence." \
    --temperature 0 --top-k 1 --top-p 1 --seed 42 --threads 1

# OpenAI
client.chat.completions.create(
    model="gpt-4o-2025-04-09",
    messages=[...],
    temperature=0,
    top_p=1,
    seed=42      # beta
)

Validate fingerprints or hashes
Cloud: compare system_fingerprint on every call.
Local: compare the model file hash at startup.
Cache and version control the outputs for anything mission-critical. Even with every knob pinned, vendors inevitably refresh weights, and a patched neural kernel on your GPU driver can flip a bit.

Why “perfect” determinism can still slip

Floating-point math is not associative. Tiny rounding differences from driver or kernel changes propagate through thousands of layers.
Speculative decoding & kv-cache tricks vary by provider and occasionally change outputs even under an identical seed.
Server-side safety rewrites or post-filters can tweak a word or punctuation mark after decoding.

In short: lock the sampler, lock the seed, lock the binary, single-thread if you can, and treat the first run as the ground-truth answer you cache. That is as deterministic as LLMs get today—and good enough for testing suites, unit-style goldens, or audit-grade pipelines.

So really, if you are "caching" the ground-truth answer, there really is no need to rerun the script on the LLM. Seems like one-and-done.

Exactly—for that specific prompt, with that exact model build and decoding config, you can treat the first run as canonical and just keep the text in a cache or fixture file. But in practice you’ll still find yourself re-hitting the model for at least three common reasons:

Why you might re-run	What triggers it	How to manage it
Prompt or input drift	Even a single-character edit (e.g., changing a date, adding a variable) means the cached answer no longer applies.	Use a hash of the full prompt + system/user messages + decode settings as your cache key. Misses fall through to the model and then enter the cache.
Model or policy updates	Vendors update weights, safety filters, or even tokenizer versions; local models get re-quantised.	Store metadata with each cached answer: `{model_id, model_sha256, system_fingerprint, seed, temperature, top_p,…}`. Invalidate (or version) the entry if any of those change.
Regression / golden-test suites	You want to know if a model change should be accepted or flagged.	Re-run tests against the new build and diff against cached goldens; promote new outputs only after review.

One-and-done in real pipelines

Unit tests & CI – Cache the goldens; re-run only when you deliberately bump model_id or alter system prompts.
High-throughput batched inference – Pre-compute and store deterministic responses for the top N frequent queries (e.g., FAQ answers) to cut latency and cost.
Interactive tools / IDE copilot – Cache still helps for “undo/redo” or when a user re-opens a file, but fresh prompts get fresh completions.

Minimal deterministic-cache sketch (Python)

python
import hashlib, json, pathlib
from openai import OpenAI

CACHE_DIR = pathlib.Path("./llm_cache")
CACHE_DIR.mkdir(exist_ok=True)

def key(prompt, cfg):
    blob = json.dumps({"prompt": prompt, **cfg}, sort_keys=True).encode()
    return hashlib.sha256(blob).hexdigest()

def cached_completion(prompt, **cfg):
    k = key(prompt, cfg)
    fp = CACHE_DIR / f"{k}.json"

    if fp.exists():                       # ← hit
        return json.loads(fp.read_text())["content"]

    # miss: call the model
    client = OpenAI()
    rsp = client.chat.completions.create(
        messages=[{"role": "user", "content": prompt}],
        **cfg
    )
    out = rsp.choices[0].message.content

    fp.write_text(json.dumps({
        "prompt": prompt,
        "cfg": cfg,
        "content": out,
        "model_id": rsp.model,
        "system_fingerprint": rsp.system_fingerprint,
    }))
    return out

Take-aways

“Yes, one-and-done for a frozen prompt+model.”
Still plan for cache invalidation—same as you would for any compiled artifact or data snapshot.
Embed version info in the cache so you can tell when it’s time to regenerate.

If you’d like pointers on specific caching layers (SQLite, Redis, local files) or integration with Ollama-CLI, let me know—happy to dive deeper.

I need a script that produces output of the same structure and format every time, but where details are different dependent on the user needs

Below is a drop-in Python template that guarantees the shape of every response stays identical (valid JSON that matches a schema) while the values change with each user request.

python
"""
deterministic_structured_llm.py
--------------------------------
Query an OpenAI (or Azure OpenAI) model and always get back a JSON
object of the form:

{
  "topic": str,
  "summary": str,
  "key_points": [str],          # ≥ 3
  "next_actions": [str]         # ≥ 2
}

The schema is enforced three ways:
  1.   response_format={"type":"json_object"}     # JSON-mode ⇢ model must emit JSON
  2.   temperature=0, top_p=1, seed=<int>         # deterministic decode
  3.   Local schema validation + automatic retry
"""

import os, json, time, hashlib
import openai
from jsonschema import validate, ValidationError

# ----------------------------------------------------------------------
# 1.  Configuration
# ----------------------------------------------------------------------
MODEL_ID = "gpt-4o-2025-04-09"         # pin an immutable build
DECODE_CONFIG = {                      # deterministic sampler
    "temperature": 0,
    "top_p": 1,
    "seed": 42,                        # reproducible-output beta :contentReference[oaicite:0]{index=0}
}

JSON_SCHEMA = {
    "type": "object",
    "required": ["topic", "summary", "key_points", "next_actions"],
    "properties": {
        "topic": {"type": "string"},
        "summary": {"type": "string"},
        "key_points": {
            "type": "array",
            "minItems": 3,
            "items": {"type": "string"}
        },
        "next_actions": {
            "type": "array",
            "minItems": 2,
            "items": {"type": "string"}
        }
    },
}

SYSTEM_PROMPT = (
    "You are a concise assistant that ALWAYS returns a JSON object matching the "
    "provided schema and NOTHING else."
)

# ----------------------------------------------------------------------
# 2.  Helper: cache responses so unchanged prompts never re-hit the model
# ----------------------------------------------------------------------
CACHE_DIR = ".cache"
os.makedirs(CACHE_DIR, exist_ok=True)

def cache_key(prompt: str) -> str:
    blob = json.dumps({"prompt": prompt, "model": MODEL_ID, **DECODE_CONFIG}, sort_keys=True)
    return hashlib.sha256(blob.encode()).hexdigest()

def load_from_cache(key: str):
    path = os.path.join(CACHE_DIR, f"{key}.json")
    return json.load(open(path)) if os.path.exists(path) else None

def save_to_cache(key: str, obj: dict):
    path = os.path.join(CACHE_DIR, f"{key}.json")
    with open(path, "w") as f: json.dump(obj, f, indent=2)

# ----------------------------------------------------------------------
# 3.  Main entry point
# ----------------------------------------------------------------------
def generate_structured_answer(topic: str) -> dict:
    user_prompt = f"Topic: {topic}\nProvide a short summary, ≥3 key points, and ≥2 next actions."
    key = cache_key(user_prompt)
    if cached := load_from_cache(key):
        return cached

    client = openai.OpenAI()
    for attempt in range(3):               # retry loop for rare JSON slips
        rsp = client.chat.completions.create(
            model=MODEL_ID,
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": user_prompt},
            ],
            response_format={"type": "json_object"},    # JSON-mode :contentReference[oaicite:1]{index=1}
            **DECODE_CONFIG,
        )
        content = rsp.choices[0].message.content
        try:
            obj = json.loads(content)
            validate(instance=obj, schema=JSON_SCHEMA)   # schema guard
            save_to_cache(key, obj)
            return obj
        except (json.JSONDecodeError, ValidationError):
            time.sleep(0.5)   # very rare; try again with same seed
    raise RuntimeError("Model failed to return valid JSON after 3 attempts.")

# ----------------------------------------------------------------------
# 4.  CLI quick-test
# ----------------------------------------------------------------------
if __name__ == "__main__":
    import sys, pprint
    topic = sys.argv[1] if len(sys.argv) > 1 else "bicycle maintenance"
    pprint.pp(generate_structured_answer(topic))

Why this stays stable

Guard-rail	What it does	Doc
`response_format={"type":"json_object"}`	Forces the model into JSON-mode so it must emit a single JSON object.	OpenAI Platform
`temperature = 0`, `top_p = 1`, `seed = 42`	Removes sampling randomness and pins a reproducible seed.	OpenAI Cookbook
JSON-schema validation + retry	Catches the (rare) case where the reply is syntactically wrong and re-prompts automatically.

Adapting it to your needs

Change the schema in JSON_SCHEMA to whatever fixed structure you require.
Rewrite the user_prompt string so the model knows how many elements or what kind of data to fill in.
Swap models (e.g., your local Ollama endpoint): keep the zero-temperature / fixed seed idea and pass --json (llama.cpp), or post-process with a validator the same way.
Integrate with your pipeline by importing generate_structured_answer and feeding it the user’s dynamic input.

With those pieces in place, you get a single script that:

Always returns the exact same JSON skeleton → easy downstream parsing.
Fills it with fresh content driven by the caller’s topic or parameters.
Skips the API call altogether when the prompt hasn’t changed (cache hit).

That gives you predictable structure and flexible, user-specific details—ideal for tests, pipelines, or any service that must never break its contract.

Will this work with Ollama supported models Llama3.2, Granite3.2, Gemma3 and Phi-4?

Yes—your deterministic-JSON pattern works with any Ollama-hosted model, including Llama-3 (8 B/70 B), IBM Granite-3.2, Gemma-3, and Phi-4—you just have to swap the OpenAI client for the Ollama API and use the flags Ollama exposes for JSON and seeding.

1 · Key Ollama switches that match the template

Goal	Ollama option	Where to set it	Notes
Force JSON-only output	`format:"json"` (or `json_schema:{…}` for the new structured output helper)	HTTP body / Python `options`	JSON mode is engine-level, so it works with every model. OllamaGitHub
Deterministic decode	`temperature:0`, `top_p:1`, `seed:42`	same `options` dict, or CLI flags `--temperature 0 --top-p 1 --seed 42`	Any fixed seed ≥ 0 is accepted. Pace 'n Think
Single-thread / same math	`threads:1` (option) or `OLLAMA_NUM_THREADS=1` env var	Optional but removes small FP drift on some GPUs.
Pin model build	Pull a specific tag (e.g. `ollama pull llama3.2:8b-q4_0`) and record the SHA256	If you rebuild or quantise differently, the seed no longer suffices.

2 · Minimal Python example (works with all four models)

import json, os, time, hashlib, ollama
from jsonschema import validate, ValidationError

MODEL_NAME = "llama3.2"          # swap to "granite3.2", "gemma3", "phi-4"
DECODE_OPTS = {"temperature": 0, "top_p": 1, "seed": 42, "format": "json"}

JSON_SCHEMA = {
    "type": "object",
    "required": ["topic", "summary", "key_points", "next_actions"],
    "properties": {
        "topic": {"type": "string"},
        "summary": {"type": "string"},
        "key_points": {"type": "array", "minItems": 3, "items": {"type": "string"}},
        "next_actions": {"type": "array", "minItems": 2, "items": {"type": "string"}}
    }
}

def cache_key(prompt: str) -> str:
    blob = json.dumps({"prompt": prompt, "model": MODEL_NAME, **DECODE_OPTS}, sort_keys=True)
    return hashlib.sha256(blob.encode()).hexdigest()

def get_answer(topic: str) -> dict:
    prompt = f'You are a JSON-only assistant.\nTopic: "{topic}". ' \
             'Return a JSON object matching the schema.'
    key = cache_key(prompt)
    path = f".cache/{key}.json"
    if os.path.exists(path):
        return json.load(open(path))

    client = ollama.Client()
    for _ in range(3):                                    # retry loop, seed fixed
        resp = client.generate(
            model=MODEL_NAME,
            prompt=prompt,
            options=DECODE_OPTS | {"json_schema": JSON_SCHEMA},  # <- structured output
            stream=False,
        )
        try:
            obj = json.loads(resp["response"])
            validate(obj, JSON_SCHEMA)
            os.makedirs(".cache", exist_ok=True); json.dump(obj, open(path, "w"))
            return obj
        except (json.JSONDecodeError, ValidationError):
            time.sleep(0.5)
    raise RuntimeError("Model failed to satisfy schema three times.")

if __name__ == "__main__":
    print(get_answer("battery recycling"))

Swap MODEL_NAME for granite3.2, gemma3, or phi-4; the rest is identical.

3 · Model-specific observations

Model	JSON obedience @ T=0	Typical caveats
Llama-3 (8 B/70 B)	Very good	Large context helps; keep prompts concise in 8 B.
IBM Granite-3.2	Good, but sometimes verbose	Add: “Do not* add explanations outside the JSON.”*
Gemma-3	Fair	May need one retry in ~5 % of calls; schema loop handles it.
Phi-4	Surprisingly reliable	Small model → higher risk of short summaries; raise `minTokens` if needed.

All four respect seed, temperature, and format, so once a prompt passes validation you can cache the output exactly as you planned.

4 · When determinism can still slip

Different quantisation of the same model tag.
Upgrading Ollama engine—kernel tweaks can change rounding behaviour. GitHubStack Overflow
Running multi-threaded or on different GPUs without deterministic math flags.

If the output truly has to be invariant across machines, ship the model file hash plus the cached JSON alongside your application and fall back to the cache by default.

Bottom line

Yes—just point the script at any Ollama model, set format:"json", fix the seed, and keep a schema-validation retry loop. That gives you identical structure every call while still letting each model fill in user-specific details.

From <https://chatgpt.com/> ChatGPT o3

Search This Blog

Learning from ChatGPT