A real-world experiment tested 3 AI systems on 315 scientific papers. Here's what we learned—and why it matters.
Average raw scores: (2 + 6 + 3) ÷ 3 = 3.7
First normalize each AI to the same scale, THEN average
| Paper | AIRev1 Score | AIRev2 Score | Gap |
|---|---|---|---|
| #277 | 1 (confident) | 6 (confident) | 5 pts |
| #148 | 2 (confident) | 6 (confident) | 4 pts |
| #32 | 2 (confident) | 6 (confident) | 4 pts |
Does the AI model they claim to use actually exist?
Do timeline claims make sense? No future data?
Do the referenced papers actually exist?
"Check all papers for impossible claims before human review"
"Let AI decide if the science is good"
| Pattern | Papers | Action |
|---|---|---|
| All AIs agree: Low scores | ~35% | Fast reject candidate |
| AIRev2 high + AIRev1 low | 25% | → Human review |
| All AIs agree: High scores | ~5% | Fast accept candidate |
| Mixed (other patterns) | ~35% | Standard process |
The math is broken (averaging doesn't work), the signals are broken (confidence means nothing), but the opportunity is real (disagreement patterns are valuable).