Large language models can feel magical—until they start confidently making things up. In a recent evaluation of 11 state-of-the-art LLMs on a real-world customer-support intent task, we discovered a simple way to tame those “hallucinations” with only a modest hit to automation.
We tested 11 LLMs on Kaggle’s Customer Support Intent Dataset, asking them to classify user messages (billing, refunds, order changes, etc.).
The headline findings:
Different LLMs tend to trip over different edge cases. One might misinterpret “cancel my order”; another might choke on a subtle refund clause. That diversity of mistakes is actually our superpower.
Imagine running each query through two independent LLMs and flagging any time they disagree. The correlation between LLM models is fairly low, as shown below:
So, if you asked two or more models the same question, we can pass any disagreements to a human to cross-check. This adds effort but improves quality. By how much?
Models | Effort | Error |
---|---|---|
1 | 0% | 14.1% |
2 | 12.6% | 3.7% |
3 | 18.5% | 2.2% |
4 | 23.0% | 1.5% |
5 | 28.1% | 0.7% |
That is, in this case:
Why does this work? Because the mistakes are mostly independent. LLMs trained on different data and architectures err in uncorrelated ways. Two models wrongly picking the same wrong answer from multiple choices is low.
WARNING: Drop models that are consistently bad. In this case, we dropped google/gemma-3-27b-it
.
So, we can introduce human-in-the-loop only when needed. Reviewers only see the ~20–25% cases where models disagree, saving time and catching nearly every error.
The choice of model combinations could matter. For example, pairing amazon/nova-lite-v1
with openai/gpt-4.1-mini
adds only ~10% effort for a 2% error rate. The best combination would depend on your use case.
If you’re battling hallucinations in your LLM pipeline, you don’t need a perfect single model—just double-check:
For a ~25% review load, multi-modal checks boost accuracy from ~85% to ~99%.
In short, rely on multiple (unreliable) LLMs rather than a single imperfect one. By turning your model fleet into a built-in safety net, you can automate at speed and quality—finally making hallucination a thing of the past.
Oh, it also saves 25% of the jobs, but you’ll need to train them on reviewing.