Your AI Isn't Hallucinating (Most of the Time)

And calling everything a "hallucination" is exactly why you're fixing it wrong, and spending 5x more than you should.

Here's what's actually happening.

When your AI gives you a wrong answer, you're probably seeing one of four completely different failure modes. But everyone uses the same word—"hallucination"—for all of them. And that imprecision costs you millions.

Let me show you what I mean.

Imagine you're training a junior analyst to summarise financial reports.

Four different things can go wrong:

Mistake Type 0: You gave them last month's data file instead of this month's. They produce a perfectly correct analysis, of the wrong data. Nothing's broken; the environment changed.

Mistake Type 1: Every time they see "EBITDA growth," they report it as positive news, even when the growth is negative. They've learnt a rule, but the rule is wrong. They're not making things up; they're applying a consistent but incorrect pattern.

Mistake Type 2: They've actually got the knowledge, but you told them "when you're uncertain, just pick randomly from your top 3 guesses." They haven't learnt to generate multiple analyses and pick the most grounded one. They're using a bad decision strategy.

Mistake Type 3: They don't have the Q3 numbers at all, but they generate a plausible-sounding "Q3 revenue increased 15% year-over-year" because that's what Q3 reports usually say. They're genuinely fabricating to fill a knowledge gap.

Only Mistake Type 3 is what we should call "hallucination", genuine fabrication from uncertainty.

But here's the problem: your AI systems make all four types of mistakes, and only Type 3 actually needs the expensive fixes you're building.

Type 0 mistakes (environment drift) happen when your retrieval index updated, your tools changed, or your data snapshot moved. Same model, same settings, different context, different answer.

The fix: Freeze your retrieval snapshots. Pin your tool versions. Log exactly what context the model saw. Make incidents reproducible.

The cost: Infrastructure work, maybe £30-80K depending on your stack.

What teams actually do: Assume the model is "unpredictable" and add human verification everywhere.

Type 1 mistakes (wrong patterns learnt during training) are deterministic. Run the same prompt 10 times with temperature zero and a fixed seed, you'll get the same wrong answer every time. The model isn't being creative, it genuinely believes the wrong answer.

The fix: Better prompts with specific examples. Step-by-step verification prompts that force the model to show its work. Programmatic checks, regex validators, calculator calls, unit tests for outputs. Sometimes fine-tuning on corrected examples.

The cost: Typical range £50-200K in prompt engineering and validation infrastructure.

What teams actually build: £2M+ human verification systems because they think it's "unpredictable hallucination."

Type 2 mistakes (poor sampling strategy) happen when the model actually has the right answer in its probability distribution, but your decoding policy fails to extract it. Ask the AI directly in simple terms and it gets it right. Ask in a complex context and it gets it wrong.

The fix: This is where decoding strategy matters. Lower your temperature and top-p to sharpen the distribution. Generate 10 candidates and rerank them by groundedness. Use self-consistency, generate multiple reasoning paths and take the majority answer. Add explicit uncertainty checks where the model can say "I'm not confident about this."

The cost: Ballpark £30-100K in sampling infrastructure improvements.

What teams actually build: Comprehensive retrieval architectures for £800K+, even though the model already knows the answer.

Type 3 mistakes (genuine knowledge gaps) are the only true hallucinations. The model has flat uncertainty, sample 20 times with different seeds and you get wildly different answers. It doesn't know, but instead of saying "I don't know," it fabricates something plausible.

The fix: Grounding architectures. Forced abstention when uncertainty exceeds a threshold. Retrieval systems to inject verified facts. Human-in-the-loop for high-stakes decisions.

The cost: £500K-2M in architectural changes, depending on scale.

This is the only case that actually needs the expensive infrastructure.

In my experience across enterprise deployments, the majority of your "hallucinations" are Type 0, 1, and 2. They're not hallucinations at all. They're environment issues, classifier errors, and sampling errors—problems with specific, testable solutions.

But because everyone calls everything "hallucination," you're building Type 3 solutions (expensive, architectural, human-in-the-loop) for Type 0, 1, and 2 problems (cheaper, environmental controls, prompt engineering, sampling improvements).

The language makes you think the problem is unpredictable and unfixable. That creates learned helplessness. "AI just does this sometimes." So you limit use cases, over-invest in oversight, and under-deploy systems that could actually work with proper diagnosis.

The diagnostic is simple you can run this this afternoon:

Take your most common AI failure.

Step 1: Pin everything. Model version, prompt template, decoding parameters. Freeze your retrieval snapshot. Set temperature to zero and fix your random seed.

Step 2: Run the exact same input 10 times.

  • Same wrong answer every time? Type 1. The model learnt wrong. You need better prompting and validation checks.

Step 3: Ask the atomic question with no context. Just the bare question.

  • Works when simplified but fails in your full context? Type 2. The model knows but your sampling strategy isn't extracting it. You need better decoding—lower temperature, generate multiple candidates and rerank, add self-consistency checks.

Step 4: Unpin the temperature. Sample 20 times with different seeds.

  • Wildly different answers with low agreement? Type 3. True hallucination. You need grounding architecture and forced abstention.

Step 5: Unfreeze your retrieval and repeat the test.

  • Output changes only when retrieval changes? Type 0. Environment drift. You need to snapshot and version your context.

Let's back-propagate through that diagnostic into your budget line items.

The precision in your diagnosis determines the precision in your spend.

What this means for your metrics:

Stop measuring just "accuracy." Start tracking:

  • Abstention rate (how often does the model admit uncertainty?)

  • Groundedness score (how much does output match your verified context?)

  • Replay rate (can you reproduce the failure?)

  • Cost per answer (are you paying Type 3 prices for Type 1 problems?)

These are the metrics that unlock sensible budget allocation.

Next time your team says "hallucination," ask them: "Which type?"

The answer determines whether you're spending £100K or £2M.

Got a costly "hallucination" fix you're working on? Drop the failure pattern below, let's run the diagnostic together and see what type it actually is.

Reply

or to participate.