LLM Filter Gallery / Evaluation Story
Narrative-Driven Data Story

Three Models Walk Into an Art Jury

They are shown the same 30 filters, the same three test images, and nearly the same instruction: judge like an expert. They do not come back with the same world.

At first glance, this looks like a scoring exercise. It is not. It is a story about what each evaluator notices first.

GPT builds a broad, generous rubric and compresses the field. Claude writes like a curator and punishes cross-input failures. Gemini brings a measurement kit, finds color signatures and palette violations, and sometimes disagrees so sharply that the dispute becomes the finding.

This page reconstructs that trial from `prompts.md`, the three evaluation reports, and the consolidated `evaluation.json`.

Loading `evaluation.json` and assembling the case file…