Dealing with Hallucinations

How can we rely on unreliable LLMs?

Large language models can feel magical—until they start confidently making things up. In a recent evaluation of 11 state-of-the-art LLMs on a real-world customer-support intent task, we discovered a simple way to tame those “hallucinations” with only a modest hit to automation.

We tested 11 LLMs on Kaggle’s Customer Support Intent Dataset, asking them to classify user messages (billing, refunds, order changes, etc.).

The headline findings:

Graph of classification accuracy. Only 20% of the customer intents are classified with under 80% accuracy

Different LLMs tend to trip over different edge cases. One might misinterpret “cancel my order”; another might choke on a subtle refund clause. That diversity of mistakes is actually our superpower.

Cross‐check with other models

Imagine running each query through two independent LLMs and flagging any time they disagree. The correlation between LLM models is fairly low, as shown below:

Correlation between LLMs. Numbers are typically between 10-30%

So, if you asked two or more models the same question, we can pass any disagreements to a human to cross-check. This adds effort but improves quality. By how much?

Models Effort Error
1 0% 14.1%
2 12.6% 3.7%
3 18.5% 2.2%
4 23.0% 1.5%
5 28.1% 0.7%

As models increase, effort increases roughly linearly and error tapers to zero

That is, in this case:

Why does this work? Because the mistakes are mostly independent. LLMs trained on different data and architectures err in uncorrelated ways. Two models wrongly picking the same wrong answer from multiple choices is low.

WARNING: Drop models that are consistently bad. In this case, we dropped google/gemma-3-27b-it.

So, we can introduce human-in-the-loop only when needed. Reviewers only see the ~20–25% cases where models disagree, saving time and catching nearly every error.

The choice of model combinations could matter. For example, pairing amazon/nova-lite-v1 with openai/gpt-4.1-mini adds only ~10% effort for a 2% error rate. The best combination would depend on your use case.

Takeaway

If you’re battling hallucinations in your LLM pipeline, you don’t need a perfect single model—just double-check:

For a ~25% review load, multi-modal checks boost accuracy from ~85% to ~99%.

In short, rely on multiple (unreliable) LLMs rather than a single imperfect one. By turning your model fleet into a built-in safety net, you can automate at speed and quality—finally making hallucination a thing of the past.

Oh, it also saves 25% of the jobs, but you’ll need to train them on reviewing.