#3179 Ask Chat: How Can I Get Consistent, Deterministic Results from my LLM?

#3179 Ask Chat: How Can I Get Consistent, Deterministic Results from my LLM?

#3179 How can I realistically get consistent, deterministic results from my LLM?

“Deterministic” means every run of the same prompt returns the identical token sequence. With today’s LLM stacks you can get very close, but there are three different randomness layers you have to lock down:

LayerWhere the randomness creeps inWhat you can realistically do
Decoder / samplingTemperature, top-p, top-k, nucleus/beam samplingForce a greedy decode: temperature = 0, top_p = 1, top_k = 1 (or simply omit the samplers entirely). This removes algorithmic randomness. GitHub
Seeder / RNG stateEach call seeds its random-number generatorPass an explicit, fixed seed. OpenAI exposes a seed field in ChatCompletion (beta) and returns a system_fingerprint so you can verify the backend hasn’t changed. OpenAI CookbookOpenAI Platform
Implementation / hardwareNon-deterministic kernels, multithreading, model updatesCloud APIs: Pin an immutable model ID (e.g., gpt-4-0125-preview) and cache the system_fingerprint. Updates to the model break determinism by design, so cache responses you must reuse. • Local inference: - Use the exact same quantised .gguf/.bin file, tokenizer build, and inference engine version. - Fix --seed, set --threads 1 (or CPU-only) to avoid thread-ordering jitter. - Turn on deterministic CUDA flags or stay on CPU; GPU kernels can diverge by 1 FP bit. GitHubStack Overflow

Putting it into practice

  1. Freeze the model

    # Ollama
    ollama pull llama3:8b-q4_0 # Keep the file and record the SHA256
  2. Call with a locked sampler and seed

    ollama run llama3:8b-q4_0 \
    --prompt "Explain Newton’s third law in one sentence." \ --temperature 0 --top-k 1 --top-p 1 --seed 42 --threads 1
    # OpenAI
    client.chat.completions.create( model="gpt-4o-2025-04-09", messages=[...], temperature=0, top_p=1, seed=42 # beta )
  3. Validate fingerprints or hashes
    Cloud: compare system_fingerprint on every call.
    Local: compare the model file hash at startup.

  4. Cache and version control the outputs for anything mission-critical. Even with every knob pinned, vendors inevitably refresh weights, and a patched neural kernel on your GPU driver can flip a bit.


Why “perfect” determinism can still slip

  • Floating-point math is not associative. Tiny rounding differences from driver or kernel changes propagate through thousands of layers.

  • Speculative decoding & kv-cache tricks vary by provider and occasionally change outputs even under an identical seed.

  • Server-side safety rewrites or post-filters can tweak a word or punctuation mark after decoding.

In short: lock the sampler, lock the seed, lock the binary, single-thread if you can, and treat the first run as the ground-truth answer you cache. That is as deterministic as LLMs get today—and good enough for testing suites, unit-style goldens, or audit-grade pipelines.


So really, if you are "caching" the ground-truth answer, there really is no need to rerun the script on the LLM. Seems like one-and-done.
ChatGPT said:

Exactly—​for that specific prompt, with that exact model build and decoding config, you can treat the first run as canonical and just keep the text in a cache or fixture file. But in practice you’ll still find yourself re-hitting the model for at least three common reasons:

Why you might re-runWhat triggers itHow to manage it
Prompt or input driftEven a single-character edit (e.g., changing a date, adding a variable) means the cached answer no longer applies.Use a hash of the full prompt + system/user messages + decode settings as your cache key. Misses fall through to the model and then enter the cache.
Model or policy updatesVendors update weights, safety filters, or even tokenizer versions; local models get re-quantised.Store metadata with each cached answer: {model_id, model_sha256, system_fingerprint, seed, temperature, top_p,…}. Invalidate (or version) the entry if any of those change.
Regression / golden-test suitesYou want to know if a model change should be accepted or flagged.Re-run tests against the new build and diff against cached goldens; promote new outputs only after review.

One-and-done in real pipelines

  • Unit tests & CI – Cache the goldens; re-run only when you deliberately bump model_id or alter system prompts.

  • High-throughput batched inference – Pre-compute and store deterministic responses for the top N frequent queries (e.g., FAQ answers) to cut latency and cost.

  • Interactive tools / IDE copilot – Cache still helps for “undo/redo” or when a user re-opens a file, but fresh prompts get fresh completions.

Minimal deterministic-cache sketch (Python)

python
import hashlib, json, pathlib from openai import OpenAI CACHE_DIR = pathlib.Path("./llm_cache") CACHE_DIR.mkdir(exist_ok=True) def key(prompt, cfg): blob = json.dumps({"prompt": prompt, **cfg}, sort_keys=True).encode() return hashlib.sha256(blob).hexdigest() def cached_completion(prompt, **cfg): k = key(prompt, cfg) fp = CACHE_DIR / f"{k}.json" if fp.exists(): # ← hit return json.loads(fp.read_text())["content"] # miss: call the model client = OpenAI() rsp = client.chat.completions.create( messages=[{"role": "user", "content": prompt}], **cfg ) out = rsp.choices[0].message.content fp.write_text(json.dumps({ "prompt": prompt, "cfg": cfg, "content": out, "model_id": rsp.model, "system_fingerprint": rsp.system_fingerprint, })) return out

Take-aways

  • “Yes, one-and-done for a frozen prompt+model.”

  • Still plan for cache invalidation—same as you would for any compiled artifact or data snapshot.

  • Embed version info in the cache so you can tell when it’s time to regenerate.

If you’d like pointers on specific caching layers (SQLite, Redis, local files) or integration with Ollama-CLI, let me know—happy to dive deeper.

You said:
I need a script that produces output of the same structure and format every time, but where details are different dependent on the user needs
ChatGPT said:

Below is a drop-in Python template that guarantees the shape of every response stays identical (valid JSON that matches a schema) while the values change with each user request.

python
""" deterministic_structured_llm.py -------------------------------- Query an OpenAI (or Azure OpenAI) model and always get back a JSON object of the form: { "topic": str, "summary": str, "key_points": [str], # ≥ 3 "next_actions": [str] # ≥ 2 } The schema is enforced three ways: 1. response_format={"type":"json_object"} # JSON-mode ⇢ model must emit JSON 2. temperature=0, top_p=1, seed=<int> # deterministic decode 3. Local schema validation + automatic retry """ import os, json, time, hashlib import openai from jsonschema import validate, ValidationError # ---------------------------------------------------------------------- # 1. Configuration # ---------------------------------------------------------------------- MODEL_ID = "gpt-4o-2025-04-09" # pin an immutable build DECODE_CONFIG = { # deterministic sampler "temperature": 0, "top_p": 1, "seed": 42, # reproducible-output beta :contentReference[oaicite:0]{index=0} } JSON_SCHEMA = { "type": "object", "required": ["topic", "summary", "key_points", "next_actions"], "properties": { "topic": {"type": "string"}, "summary": {"type": "string"}, "key_points": { "type": "array", "minItems": 3, "items": {"type": "string"} }, "next_actions": { "type": "array", "minItems": 2, "items": {"type": "string"} } }, } SYSTEM_PROMPT = ( "You are a concise assistant that ALWAYS returns a JSON object matching the " "provided schema and NOTHING else." ) # ---------------------------------------------------------------------- # 2. Helper: cache responses so unchanged prompts never re-hit the model # ---------------------------------------------------------------------- CACHE_DIR = ".cache" os.makedirs(CACHE_DIR, exist_ok=True) def cache_key(prompt: str) -> str: blob = json.dumps({"prompt": prompt, "model": MODEL_ID, **DECODE_CONFIG}, sort_keys=True) return hashlib.sha256(blob.encode()).hexdigest() def load_from_cache(key: str): path = os.path.join(CACHE_DIR, f"{key}.json") return json.load(open(path)) if os.path.exists(path) else None def save_to_cache(key: str, obj: dict): path = os.path.join(CACHE_DIR, f"{key}.json") with open(path, "w") as f: json.dump(obj, f, indent=2) # ---------------------------------------------------------------------- # 3. Main entry point # ---------------------------------------------------------------------- def generate_structured_answer(topic: str) -> dict: user_prompt = f"Topic: {topic}\nProvide a short summary, ≥3 key points, and ≥2 next actions." key = cache_key(user_prompt) if cached := load_from_cache(key): return cached client = openai.OpenAI() for attempt in range(3): # retry loop for rare JSON slips rsp = client.chat.completions.create( model=MODEL_ID, messages=[ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": user_prompt}, ], response_format={"type": "json_object"}, # JSON-mode :contentReference[oaicite:1]{index=1} **DECODE_CONFIG, ) content = rsp.choices[0].message.content try: obj = json.loads(content) validate(instance=obj, schema=JSON_SCHEMA) # schema guard save_to_cache(key, obj) return obj except (json.JSONDecodeError, ValidationError): time.sleep(0.5) # very rare; try again with same seed raise RuntimeError("Model failed to return valid JSON after 3 attempts.") # ---------------------------------------------------------------------- # 4. CLI quick-test # ---------------------------------------------------------------------- if __name__ == "__main__": import sys, pprint topic = sys.argv[1] if len(sys.argv) > 1 else "bicycle maintenance" pprint.pp(generate_structured_answer(topic))

Why this stays stable

Guard-railWhat it doesDoc
response_format={"type":"json_object"}Forces the model into JSON-mode so it must emit a single JSON object.OpenAI Platform
temperature = 0, top_p = 1, seed = 42Removes sampling randomness and pins a reproducible seed.OpenAI Cookbook
JSON-schema validation + retryCatches the (rare) case where the reply is syntactically wrong and re-prompts automatically.

Adapting it to your needs

  1. Change the schema in JSON_SCHEMA to whatever fixed structure you require.

  2. Rewrite the user_prompt string so the model knows how many elements or what kind of data to fill in.

  3. Swap models (e.g., your local Ollama endpoint): keep the zero-temperature / fixed seed idea and pass --json (llama.cpp), or post-process with a validator the same way.

  4. Integrate with your pipeline by importing generate_structured_answer and feeding it the user’s dynamic input.

With those pieces in place, you get a single script that:

  • Always returns the exact same JSON skeleton → easy downstream parsing.

  • Fills it with fresh content driven by the caller’s topic or parameters.

  • Skips the API call altogether when the prompt hasn’t changed (cache hit).

That gives you predictable structure and flexible, user-specific details—ideal for tests, pipelines, or any service that must never break its contract.


Will this work with Ollama supported models Llama3.2, Granite3.2, Gemma3 and Phi-4?

Yes—​your deterministic-JSON pattern works with any Ollama-hosted model, including Llama-3 (8 B/70 B), IBM Granite-3.2, Gemma-3, and Phi-4—​you just have to swap the OpenAI client for the Ollama API and use the flags Ollama exposes for JSON and seeding.

1 · Key Ollama switches that match the template

GoalOllama optionWhere to set itNotes
Force JSON-only outputformat:"json"
(or json_schema:{…} for the new structured output helper)
HTTP body / Python optionsJSON mode is engine-level, so it works with every model. OllamaGitHub
Deterministic decodetemperature:0, top_p:1, seed:42same options dict, or CLI flags --temperature 0 --top-p 1 --seed 42Any fixed seed ≥ 0 is accepted. Pace 'n Think
Single-thread / same maththreads:1 (option) or OLLAMA_NUM_THREADS=1 env varOptional but removes small FP drift on some GPUs.
Pin model buildPull a specific tag (e.g. ollama pull llama3.2:8b-q4_0) and record the SHA256If you rebuild or quantise differently, the seed no longer suffices.

2 · Minimal Python example (works with all four models)

import json, os, time, hashlib, ollama from jsonschema import validate, ValidationError MODEL_NAME = "llama3.2" # swap to "granite3.2", "gemma3", "phi-4" DECODE_OPTS = {"temperature": 0, "top_p": 1, "seed": 42, "format": "json"} JSON_SCHEMA = { "type": "object", "required": ["topic", "summary", "key_points", "next_actions"], "properties": { "topic": {"type": "string"}, "summary": {"type": "string"}, "key_points": {"type": "array", "minItems": 3, "items": {"type": "string"}}, "next_actions": {"type": "array", "minItems": 2, "items": {"type": "string"}} } } def cache_key(prompt: str) -> str: blob = json.dumps({"prompt": prompt, "model": MODEL_NAME, **DECODE_OPTS}, sort_keys=True) return hashlib.sha256(blob.encode()).hexdigest() def get_answer(topic: str) -> dict: prompt = f'You are a JSON-only assistant.\nTopic: "{topic}". ' \ 'Return a JSON object matching the schema.' key = cache_key(prompt) path = f".cache/{key}.json" if os.path.exists(path): return json.load(open(path)) client = ollama.Client() for _ in range(3): # retry loop, seed fixed resp = client.generate( model=MODEL_NAME, prompt=prompt, options=DECODE_OPTS | {"json_schema": JSON_SCHEMA}, # <- structured output stream=False, ) try: obj = json.loads(resp["response"]) validate(obj, JSON_SCHEMA) os.makedirs(".cache", exist_ok=True); json.dump(obj, open(path, "w")) return obj except (json.JSONDecodeError, ValidationError): time.sleep(0.5) raise RuntimeError("Model failed to satisfy schema three times.") if __name__ == "__main__": print(get_answer("battery recycling"))

Swap MODEL_NAME for granite3.2, gemma3, or phi-4; the rest is identical.


3 · Model-specific observations

ModelJSON obedience @ T=0Typical caveats
Llama-3 (8 B/70 B)Very goodLarge context helps; keep prompts concise in 8 B.
IBM Granite-3.2Good, but sometimes verboseAdd: “Do not add explanations outside the JSON.”
Gemma-3FairMay need one retry in ~5 % of calls; schema loop handles it.
Phi-4Surprisingly reliableSmall model → higher risk of short summaries; raise minTokens if needed.

All four respect seed, temperature, and format, so once a prompt passes validation you can cache the output exactly as you planned.


4 · When determinism can still slip

  • Different quantisation of the same model tag.

  • Upgrading Ollama engine—​kernel tweaks can change rounding behaviour. GitHubStack Overflow

  • Running multi-threaded or on different GPUs without deterministic math flags.

If the output truly has to be invariant across machines, ship the model file hash plus the cached JSON alongside your application and fall back to the cache by default.


Bottom line

Yes—​just point the script at any Ollama model, set format:"json", fix the seed, and keep a schema-validation retry loop. That gives you identical structure every call while still letting each model fill in user-specific details.


From <https://chatgpt.com/>  ChatGPT o3

Comments

Popular posts from this blog

#2892 Example of ChatGPT o1 Prompt

#3105 Federal and State Spending Idiocy

#3106 Google Gemini Advanced 2.0 Flash: NASA Software Process Requirements Checklist Template