#3179 Ask Chat: How Can I Get Consistent, Deterministic Results from my LLM?
- Get link
- X
- Other Apps
#3179 Ask Chat: How Can I Get Consistent, Deterministic Results from my LLM?
#3179 How can I realistically get consistent, deterministic results from my LLM?
“Deterministic” means every run of the same prompt returns the identical token sequence. With today’s LLM stacks you can get very close, but there are three different randomness layers you have to lock down:
Layer | Where the randomness creeps in | What you can realistically do |
---|---|---|
Decoder / sampling | Temperature, top-p, top-k, nucleus/beam sampling | Force a greedy decode: temperature = 0 , top_p = 1 , top_k = 1 (or simply omit the samplers entirely). This removes algorithmic randomness. GitHub |
Seeder / RNG state | Each call seeds its random-number generator | Pass an explicit, fixed seed. OpenAI exposes a seed field in ChatCompletion (beta) and returns a system_fingerprint so you can verify the backend hasn’t changed. OpenAI CookbookOpenAI Platform |
Implementation / hardware | Non-deterministic kernels, multithreading, model updates | • Cloud APIs: Pin an immutable model ID (e.g., gpt-4-0125-preview ) and cache the system_fingerprint . Updates to the model break determinism by design, so cache responses you must reuse. • Local inference: - Use the exact same quantised .gguf/.bin file, tokenizer build, and inference engine version. - Fix --seed , set --threads 1 (or CPU-only) to avoid thread-ordering jitter. - Turn on deterministic CUDA flags or stay on CPU; GPU kernels can diverge by 1 FP bit. GitHubStack Overflow |
Putting it into practice
-
Freeze the model
# Ollamaollama pull llama3:8b-q4_0 # Keep the file and record the SHA256
-
Call with a locked sampler and seed
ollama run llama3:8b-q4_0 \--prompt "Explain Newton’s third law in one sentence." \ --temperature 0 --top-k 1 --top-p 1 --seed 42 --threads 1
# OpenAIclient.chat.completions.create( model="gpt-4o-2025-04-09", messages=[...], temperature=0, top_p=1, seed=42 # beta )
-
Validate fingerprints or hashes
Cloud: comparesystem_fingerprint
on every call.
Local: compare the model file hash at startup. -
Cache and version control the outputs for anything mission-critical. Even with every knob pinned, vendors inevitably refresh weights, and a patched neural kernel on your GPU driver can flip a bit.
Why “perfect” determinism can still slip
-
Floating-point math is not associative. Tiny rounding differences from driver or kernel changes propagate through thousands of layers.
-
Speculative decoding & kv-cache tricks vary by provider and occasionally change outputs even under an identical seed.
-
Server-side safety rewrites or post-filters can tweak a word or punctuation mark after decoding.
In short: lock the sampler, lock the seed, lock the binary, single-thread if you can, and treat the first run as the ground-truth answer you cache. That is as deterministic as LLMs get today—and good enough for testing suites, unit-style goldens, or audit-grade pipelines.
ChatGPT said:
Exactly—for that specific prompt, with that exact model build and decoding config, you can treat the first run as canonical and just keep the text in a cache or fixture file. But in practice you’ll still find yourself re-hitting the model for at least three common reasons:
Why you might re-run | What triggers it | How to manage it |
---|---|---|
Prompt or input drift | Even a single-character edit (e.g., changing a date, adding a variable) means the cached answer no longer applies. | Use a hash of the full prompt + system/user messages + decode settings as your cache key. Misses fall through to the model and then enter the cache. |
Model or policy updates | Vendors update weights, safety filters, or even tokenizer versions; local models get re-quantised. | Store metadata with each cached answer: {model_id, model_sha256, system_fingerprint, seed, temperature, top_p,…} . Invalidate (or version) the entry if any of those change. |
Regression / golden-test suites | You want to know if a model change should be accepted or flagged. | Re-run tests against the new build and diff against cached goldens; promote new outputs only after review. |
One-and-done in real pipelines
-
Unit tests & CI – Cache the goldens; re-run only when you deliberately bump
model_id
or alter system prompts. -
High-throughput batched inference – Pre-compute and store deterministic responses for the top N frequent queries (e.g., FAQ answers) to cut latency and cost.
-
Interactive tools / IDE copilot – Cache still helps for “undo/redo” or when a user re-opens a file, but fresh prompts get fresh completions.
Minimal deterministic-cache sketch (Python)
pythonimport hashlib, json, pathlib
from openai import OpenAI
CACHE_DIR = pathlib.Path("./llm_cache")
CACHE_DIR.mkdir(exist_ok=True)
def key(prompt, cfg):
blob = json.dumps({"prompt": prompt, **cfg}, sort_keys=True).encode()
return hashlib.sha256(blob).hexdigest()
def cached_completion(prompt, **cfg):
k = key(prompt, cfg)
fp = CACHE_DIR / f"{k}.json"
if fp.exists(): # ← hit
return json.loads(fp.read_text())["content"]
# miss: call the model
client = OpenAI()
rsp = client.chat.completions.create(
messages=[{"role": "user", "content": prompt}],
**cfg
)
out = rsp.choices[0].message.content
fp.write_text(json.dumps({
"prompt": prompt,
"cfg": cfg,
"content": out,
"model_id": rsp.model,
"system_fingerprint": rsp.system_fingerprint,
}))
return out
Take-aways
-
“Yes, one-and-done for a frozen prompt+model.”
-
Still plan for cache invalidation—same as you would for any compiled artifact or data snapshot.
-
Embed version info in the cache so you can tell when it’s time to regenerate.
If you’d like pointers on specific caching layers (SQLite, Redis, local files) or integration with Ollama-CLI, let me know—happy to dive deeper.
You said:
ChatGPT said:
Below is a drop-in Python template that guarantees the shape of every response stays identical (valid JSON that matches a schema) while the values change with each user request.
python"""
deterministic_structured_llm.py
--------------------------------
Query an OpenAI (or Azure OpenAI) model and always get back a JSON
object of the form:
{
"topic": str,
"summary": str,
"key_points": [str], # ≥ 3
"next_actions": [str] # ≥ 2
}
The schema is enforced three ways:
1. response_format={"type":"json_object"} # JSON-mode ⇢ model must emit JSON
2. temperature=0, top_p=1, seed=<int> # deterministic decode
3. Local schema validation + automatic retry
"""
import os, json, time, hashlib
import openai
from jsonschema import validate, ValidationError
# ----------------------------------------------------------------------
# 1. Configuration
# ----------------------------------------------------------------------
MODEL_ID = "gpt-4o-2025-04-09" # pin an immutable build
DECODE_CONFIG = { # deterministic sampler
"temperature": 0,
"top_p": 1,
"seed": 42, # reproducible-output beta :contentReference[oaicite:0]{index=0}
}
JSON_SCHEMA = {
"type": "object",
"required": ["topic", "summary", "key_points", "next_actions"],
"properties": {
"topic": {"type": "string"},
"summary": {"type": "string"},
"key_points": {
"type": "array",
"minItems": 3,
"items": {"type": "string"}
},
"next_actions": {
"type": "array",
"minItems": 2,
"items": {"type": "string"}
}
},
}
SYSTEM_PROMPT = (
"You are a concise assistant that ALWAYS returns a JSON object matching the "
"provided schema and NOTHING else."
)
# ----------------------------------------------------------------------
# 2. Helper: cache responses so unchanged prompts never re-hit the model
# ----------------------------------------------------------------------
CACHE_DIR = ".cache"
os.makedirs(CACHE_DIR, exist_ok=True)
def cache_key(prompt: str) -> str:
blob = json.dumps({"prompt": prompt, "model": MODEL_ID, **DECODE_CONFIG}, sort_keys=True)
return hashlib.sha256(blob.encode()).hexdigest()
def load_from_cache(key: str):
path = os.path.join(CACHE_DIR, f"{key}.json")
return json.load(open(path)) if os.path.exists(path) else None
def save_to_cache(key: str, obj: dict):
path = os.path.join(CACHE_DIR, f"{key}.json")
with open(path, "w") as f: json.dump(obj, f, indent=2)
# ----------------------------------------------------------------------
# 3. Main entry point
# ----------------------------------------------------------------------
def generate_structured_answer(topic: str) -> dict:
user_prompt = f"Topic: {topic}\nProvide a short summary, ≥3 key points, and ≥2 next actions."
key = cache_key(user_prompt)
if cached := load_from_cache(key):
return cached
client = openai.OpenAI()
for attempt in range(3): # retry loop for rare JSON slips
rsp = client.chat.completions.create(
model=MODEL_ID,
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_prompt},
],
response_format={"type": "json_object"}, # JSON-mode :contentReference[oaicite:1]{index=1}
**DECODE_CONFIG,
)
content = rsp.choices[0].message.content
try:
obj = json.loads(content)
validate(instance=obj, schema=JSON_SCHEMA) # schema guard
save_to_cache(key, obj)
return obj
except (json.JSONDecodeError, ValidationError):
time.sleep(0.5) # very rare; try again with same seed
raise RuntimeError("Model failed to return valid JSON after 3 attempts.")
# ----------------------------------------------------------------------
# 4. CLI quick-test
# ----------------------------------------------------------------------
if __name__ == "__main__":
import sys, pprint
topic = sys.argv[1] if len(sys.argv) > 1 else "bicycle maintenance"
pprint.pp(generate_structured_answer(topic))
Why this stays stable
Guard-rail | What it does | Doc |
---|---|---|
response_format={"type":"json_object"} | Forces the model into JSON-mode so it must emit a single JSON object. | OpenAI Platform |
temperature = 0 , top_p = 1 , seed = 42 | Removes sampling randomness and pins a reproducible seed. | OpenAI Cookbook |
JSON-schema validation + retry | Catches the (rare) case where the reply is syntactically wrong and re-prompts automatically. |
Adapting it to your needs
-
Change the schema in
JSON_SCHEMA
to whatever fixed structure you require. -
Rewrite the
user_prompt
string so the model knows how many elements or what kind of data to fill in. -
Swap models (e.g., your local Ollama endpoint): keep the zero-temperature / fixed seed idea and pass
--json
(llama.cpp), or post-process with a validator the same way. -
Integrate with your pipeline by importing
generate_structured_answer
and feeding it the user’s dynamic input.
With those pieces in place, you get a single script that:
-
Always returns the exact same JSON skeleton → easy downstream parsing.
-
Fills it with fresh content driven by the caller’s topic or parameters.
-
Skips the API call altogether when the prompt hasn’t changed (cache hit).
That gives you predictable structure and flexible, user-specific details—ideal for tests, pipelines, or any service that must never break its contract.
1 · Key Ollama switches that match the template
Goal | Ollama option | Where to set it | Notes |
---|---|---|---|
Force JSON-only output | format:"json" (or json_schema:{…} for the new structured output helper) | HTTP body / Python options | JSON mode is engine-level, so it works with every model. OllamaGitHub |
Deterministic decode | temperature:0 , top_p:1 , seed:42 | same options dict, or CLI flags --temperature 0 --top-p 1 --seed 42 | Any fixed seed ≥ 0 is accepted. Pace 'n Think |
Single-thread / same math | threads:1 (option) or OLLAMA_NUM_THREADS=1 env var | Optional but removes small FP drift on some GPUs. | |
Pin model build | Pull a specific tag (e.g. ollama pull llama3.2:8b-q4_0 ) and record the SHA256 | If you rebuild or quantise differently, the seed no longer suffices. |
2 · Minimal Python example (works with all four models)
import json, os, time, hashlib, ollama
from jsonschema import validate, ValidationError
MODEL_NAME = "llama3.2" # swap to "granite3.2", "gemma3", "phi-4"
DECODE_OPTS = {"temperature": 0, "top_p": 1, "seed": 42, "format": "json"}
JSON_SCHEMA = {
"type": "object",
"required": ["topic", "summary", "key_points", "next_actions"],
"properties": {
"topic": {"type": "string"},
"summary": {"type": "string"},
"key_points": {"type": "array", "minItems": 3, "items": {"type": "string"}},
"next_actions": {"type": "array", "minItems": 2, "items": {"type": "string"}}
}
}
def cache_key(prompt: str) -> str:
blob = json.dumps({"prompt": prompt, "model": MODEL_NAME, **DECODE_OPTS}, sort_keys=True)
return hashlib.sha256(blob.encode()).hexdigest()
def get_answer(topic: str) -> dict:
prompt = f'You are a JSON-only assistant.\nTopic: "{topic}". ' \
'Return a JSON object matching the schema.'
key = cache_key(prompt)
path = f".cache/{key}.json"
if os.path.exists(path):
return json.load(open(path))
client = ollama.Client()
for _ in range(3): # retry loop, seed fixed
resp = client.generate(
model=MODEL_NAME,
prompt=prompt,
options=DECODE_OPTS | {"json_schema": JSON_SCHEMA}, # <- structured output
stream=False,
)
try:
obj = json.loads(resp["response"])
validate(obj, JSON_SCHEMA)
os.makedirs(".cache", exist_ok=True); json.dump(obj, open(path, "w"))
return obj
except (json.JSONDecodeError, ValidationError):
time.sleep(0.5)
raise RuntimeError("Model failed to satisfy schema three times.")
if __name__ == "__main__":
print(get_answer("battery recycling"))
Swap MODEL_NAME
for granite3.2
, gemma3
, or phi-4
; the rest is identical.
3 · Model-specific observations
Model | JSON obedience @ T=0 | Typical caveats |
---|---|---|
Llama-3 (8 B/70 B) | Very good | Large context helps; keep prompts concise in 8 B. |
IBM Granite-3.2 | Good, but sometimes verbose | Add: “Do not add explanations outside the JSON.” |
Gemma-3 | Fair | May need one retry in ~5 % of calls; schema loop handles it. |
Phi-4 | Surprisingly reliable | Small model → higher risk of short summaries; raise minTokens if needed. |
All four respect seed
, temperature
, and format
, so once a prompt passes validation you can cache the output exactly as you planned.
4 · When determinism can still slip
-
Different quantisation of the same model tag.
-
Upgrading Ollama engine—kernel tweaks can change rounding behaviour. GitHubStack Overflow
-
Running multi-threaded or on different GPUs without deterministic math flags.
If the output truly has to be invariant across machines, ship the model file hash plus the cached JSON alongside your application and fall back to the cache by default.
Bottom line
Yes—just point the script at any Ollama model, set format:"json"
, fix the seed, and keep a schema-validation retry loop. That gives you identical structure every call while still letting each model fill in user-specific details.
From <https://chatgpt.com/> ChatGPT o3
Comments
Post a Comment