They are shown the same 30 filters, the same three test images, and nearly the same instruction: judge like an expert. They do not come back with the same world.
At first glance, this looks like a scoring exercise. It is not. It is a story about what each evaluator notices first.
GPT builds a broad, generous rubric and compresses the field. Claude writes like a curator and punishes cross-input failures. Gemini brings a measurement kit, finds color signatures and palette violations, and sometimes disagrees so sharply that the dispute becomes the finding.
This page reconstructs that trial from `prompts.md`, the three evaluation reports, and the consolidated `evaluation.json`.