CTO AI Insights
Posts
Why Your LLM Gives Different Answers Every Time (And it is okay)

Why Your LLM Gives Different Answers Every Time (And it is okay)

Chris Jones
October 21, 2025

You run the same prompt twice. You get two different answers. You think: "Is this thing just making it up as it goes?"

I had this exact thought roughly 18 months ago (it feels like 15 years ago). I was debugging a RAG system that would give brilliant answers on Monday and nonsense on Tuesday.

Same question. Same model.

Different universe, apparently.

Here's what's actually happening: Large Language Models aren't random. Yep, despite what you have heard, they're deterministic engines running through a precise sequence of transformations. But at each stage of that sequence, you've got dials. And most people are turning those dials without knowing which part of the machine they're actually affecting.

This isn't your fault.

The abstraction layer is genuinely misleading. API docs tell you "set temperature to 0.7" without explaining what temperature is or where it lives in the pipeline. It's like telling someone to "adjust the carburettor" without explaining what a carburettor does or where to find it.

But once you see where each parameter lives in the pipeline, LLM behaviour stops being mysterious and starts being controllable. Let me show you the sequence.

The Pipeline Itself

Every time you send a query to an LLM, it goes through six distinct stages:

Input Processing → Logits → Softmax → Probabilities → Token Selection → Final Output

Think of it like planning a route with a sat nav or Google Maps:

You input your destination (the prompt).
The sat-nav computes costs for every possible turn (logits).
It converts those costs into a preference map (softmax).
It prunes implausible routes (probabilities).
It picks the next turn (token selection).
It stitches turns together until you reach the destination (final output).

The crucial thing to understand is that the model itself—the neural network with all its weights (networked words)—is completely deterministic. If you feed it exactly the same input, it will always produce exactly the same raw scores (logits).

Always. Every single time.

What are logits? These are raw numerical scores for every possible word (token) in its vocabulary that could come next. Think of it as the model's internal "confidence scores" for tens of thousands of potential next words, ranging from negative to positive numbers.

BUT where does the variation come from?

It comes from three places. And if you understand these three sources, you understand everything about why LLMs behave the way they do. Well almost.

The Big Three: Engine, Decoder, Environment

Here's the mental model that changed how I think about LLMs.

The Engine is the model itself. It's deterministic. Same input → same logits. This is the part that learned language by reading the internet. It's frozen once trained. It doesn't change. Now we do not want to use all the logit potentials in an output, we have to pick between them all now.

Enter the decoder.

The Decoder is your policy for turning those logits into actual choices. Do you always pick the highest-scoring option? Do you sample from the distribution? Do you explore multiple paths? This is where parameters like temperature and top-p live. This is where variation enters the system by design.

The Environment is everything external to the model that can change between runs. Your RAG retrieval returns different documents/chunks. Your tool calls hit different API states. The timestamp changes. The world moves underneath you.

Pin the engine and environment, and you get reproducibility. Tune the decoder, and you control the creativity-consistency tradeoff. Let the environment drift, and you get variation even with identical prompts.

It is frustrating to hear when folk toil with this fact, but if you understand this, you get that this is not a bug or a shortcoming of LLMs, it is a feature and one you can control.

That's it. That's the whole picture.

Now let me show you where each dial actually lives.

Where the Dials Are

Most confusion about LLM parameters comes from not knowing which stage of the pipeline they affect. People treat "temperature" and "top-p" as interchangeable "randomness knobs" when they're actually doing completely different things at different stages.

Here's the real mapping.

At the Input Stage: The World Changes

Environment drift is the first source of variation, and it happens before the model even runs. Your retrieval system pulls different documents from the vector database. Your tool calls return different data. The timestamp is different. The same is true for popular chatbots, think Claude Projects or even those who have created a CustomGPT. The files you store in that are retrieved through the same method.

This is why you can't reproduce a RAG system's output just by saving the prompt. You need to save the entire context: which documents were retrieved, what the tools returned, what time it was. Without this, you're trying to replay a journey without knowing what the traffic was. TapeAgents, if you know, you know. Look it up.

This is where the manifest comes in—the governance record that nobody thinks about until they're trying to debug an incident three weeks later.

Model version, prompt template, retrieval snapshot IDs, tool versions, timestamps. Freeze this, and you can replay exactly what happened. Skip it, and you're guessing.

At the Logits Stage: Shaping Raw Scores

The model has produced its raw scores. Before we turn them into probabilities, we can transform them.

Repetition penalties and presence penalties adjust the logits to discourage the model from repeating itself. "You've already said 'however' three times. Let's lower the score for saying it again."

Frequency penalties do the same thing based on how often tokens appear in the training data.

These are all additive transforms applied directly to the logits before anything else happens. They're not about choosing tokens—they're about adjusting the scores before we even consider choice.

At the Softmax Stage: The Adventurousness Dial

Now we're turning those raw logits into probabilities. This is where temperature lives.

Here's what temperature actually does: it divides every logit by T before running the softmax. That's it. That's the whole mechanism.

Softmax is a mathematical function that converts raw numerical scores into probabilities that sum to exactly 1, making them interpretable as a probability distribution. It's essentially a normalisation function that takes any set of real numbers (including negatives) and transforms them into values between 0 and 1 that represent the likelihood of each possible outcome.

Low temperature (T = 0.2): the highest-scoring option becomes even more dominant. The distribution gets sharper. You're being conservative.

High temperature (T = 1.5): the alternatives get more weight. The distribution flattens. You're being adventurous.

Temperature = 1.0: no change. This is the model's "natural" distribution.

And "temperature = 0"? That's not actually a temperature. It's a switch that says "skip all this probability stuff and just pick the highest-scoring option every time."

It's what we call greedy decoding in model speak (orc tongue).

Temperature isn't "randomness." It's how seriously you take the alternatives. Big difference.

At the Probabilities Stage: Pruning the Options

You've got a probability distribution. Maybe it's sharp, maybe it's flat. Either way, you might want to prune it before making a choice.

Top-k says: "Only consider the top k highest-probability tokens."

For example, if you set top-k = 50, the model can only choose from the 50 most likely next words, completely ignoring all other possibilities. This uses a fixed number regardless of how confident the model is. It's particularly useful when you want more deterministic and predictable outputs, as it prevents the model from selecting rare or irrelevant tokens.

Top-p (nucleus sampling) says: "Keep adding tokens in descending order until you've covered p% of the probability mass."

For instance, if the top token has 80% probability and the next has 15%, only those two would be considered since they exceed 90% together. Setting top-p closer to 1.0 (like 0.95) makes the LLM more creative, while lower values (like 0.5) make it safer and more focused.

Both are doing the same conceptual thing: removing implausible options before you sample. They're not choosing the token—they're narrowing the field.

This is why you can use them together. Top-k with top-p. They're both just pruning strategies. One counts tokens, one counts probability mass.

At the Token Selection Stage: Making the Choice

Now you actually have to pick a token. This is the moment where probability becomes reality.

Greedy decoding: pick the highest-probability token.

Deterministic. Boring. Consistent.

Is like always taking the exit with the best sat-nav score. Every single time. No exceptions. If the sat-nav says "first exit, 23 minutes to destination," you take the first exit. Deterministic.

Boring. Consistent. You'll get there, but you'll always take the same route.

Sampling: draw from the (possibly pruned) distribution.

The seed you set here determines which random number generator state you're using. Set the seed, and your "random" sampling becomes reproducible. Same seed + same distribution = same sample. It’s like letting a weighted coin decide. Exits with better scores get more faces on the coin, but even the third-best exit gets a chance.

The seed you set is like using the same coin with the same starting position—flip it the same way, and you'll get the same sequence of heads and tails. Set the seed, and your "random" sampling becomes reproducible. Same seed + same distribution = same sample. You can replay the exact same "random" journey.

Beam search: maintain multiple candidate sequences simultaneously. Explore several paths through the probability space. At each step, keep the top-k partial sequences by score. At the end, pick the highest-scoring complete sequence.

Beam search isn't sampling from the model's distribution—it's doing a search using those probabilities to guide exploration. Different category of technique entirely.

At the Output Stage: Choosing Among Candidates

You've generated a complete sequence. Or maybe you've generated several. Now what?

This is where rerankers live. Generate multiple candidates (via sampling or beams), then score them with something more sophisticated than token probabilities. Maybe you check groundedness against source documents. Maybe you run them through a reward model. Maybe you literally use another LLM as a judge.

This is the "explore, then commit" pattern, and it's increasingly how production systems work. You let the model explore the possibility space through sampling, then you deterministically choose the best option according to what you actually care about.

Governance gates also live here: safety filters, policy checks, audit logging. You're not shaping the generation—you're deciding whether to release it.

What This Means Tomorrow Morning

Here's how this changes the way you work.

When outputs vary batch to batch, check environment drift first.

Are your RAG results stable?

Are your tool calls returning consistent data?

Log the manifest. Then check the seed if you're sampling.

When the model is too creative (making things up, going off-topic), lower the temperature before touching anything else.

You're telling it to stick closer to the most likely path.
If that's not enough, then reduce top-p to prune more aggressively.

When the model is too boring (repetitive, predictable, generic), raise temperature and top-p together.

You're both flattening the distribution and keeping more alternatives in play.
Consider adding presence penalties to discourage repetition.

When you need reproducibility, pin everything:

freeze the environment (save the manifest),
set the seed if sampling, or
switch to greedy decoding. The only randomness should be deliberate.

When you need quality, don't just sample once at high temperature.

Use sampling to generate multiple candidates, then rerank them.
Exploration plus deterministic commitment.

When you're debugging, know which stage you're actually adjusting. "The output is too random" might mean the temperature is too high (decoder problem) or the retrieval is unstable (environment problem) or you forgot to set the seed (selection problem). Different stages, different solutions.

The Thing Nobody Tells You

LLMs aren't mysterious. They're pipelines with dials at specific stages.

The model itself is deterministic. Temperature isn't "randomness"—it's how seriously to consider alternatives. Top-p isn't "creativity"—it's pruning the candidate set before sampling. Seed isn't magic—it's making your PRNG replayable. And those different answers you're getting? They're coming from the decoder or the environment, not from the model being "random."

Once you see where each dial lives, you stop guessing and start engineering. You know that tweaking temperature won't help if the problem is unstable retrieval. You know that setting seed to zero won't make anything deterministic if you're still sampling. You know that turning all the knobs to maximum "creativity" is just making the model consider implausible options more seriously.

This is the difference between using LLMs and understanding them.

You now know where the dials are. Use them precisely.

What's one parameter you've been using wrong? I'd genuinely love to hear the patterns you've discovered—especially the ones you found by breaking things first.

Reply

or to participate.