In 2026, 765 students generated nearly two thousand AI images and had them all evaluated by Gemini. The results didn't match the marketing. The best model for one creative brief was the worst for another. The map of task fit had been hiding in plain sight.
Everyone Had a Favorite, and Most People Used It for Everything
DALL·E dominated the dataset with 935 image rows — nearly 54% of all evaluated images. Most students had a single preferred model and used it across all four briefs. This is the natural behavior: pick a tool, learn it well, apply it everywhere. It's efficient. It's also wrong.
No Model Wins Every Brief — The Best Choice Rotates
Each row has a different winner. Ideogram leads Affective. Midjourney leads Concept. Gemini leads Paradox. Ideogram and Midjourney tie Style. DALL·E, the most-used model, wins none.
DALL·E's Split Personality
DALL·E scores 8.557 on Concept Incarnation — among the highest in the dataset. On Affective Chart, it scores 7.013 with a 43.5% exhibition rate. That gap — 1.5 points, nearly two standard deviations — is not noise. It's structure.
The Affective Chart brief requires something that DALL·E appears poorly calibrated for: making an image that feels emotionally resonant and data-informative simultaneously. It's the brief that most demands constraint adherence alongside visual originality. DALL·E's tendency toward polished, slightly literal interpretations works against it here.
"DALL·E wins Concept Incarnation and loses Affective Chart. The gap is 1.5 points. In grading terms, that's the difference between a strong pass and a borderline one."
Every Brief Has a Different Winner
The model rankings reshuffle completely across briefs. No single model dominates all four.
Semantic Match Is the Real Threshold
The rubric has dimensions — composition, emotional resonance, technical execution. But the data reveals a hidden axis that cuts across all of them: does the evaluator recognize what you intended? The data calls this 'semantic match' — whether the generated image demonstrably contains the required conceptual element.
When semantic match fails, scores collapse.
Miss the Semantic Target and Lose 3–5 Points
For Style Transplant, failing tradition_match means a score of 3.1 vs 8.2 — the largest single predictive gap in the entire image evaluation dataset.
The Semantic Threshold: Not 'Is It Beautiful?' But 'Did Gemini Understand It?'
This reveals something important about how the evaluation worked. Gemini was not just scoring beauty or technical craft. It was checking: does this image instantiate the concept the brief called for? A stunning image that missed the concept could score below 6. An average image that nailed it could score above 8.
This means the optimal strategy was not to make the most beautiful image — it was to make sure the image communicated its intent clearly enough for an AI evaluator to recognize it.
"The question wasn't 'is this beautiful?' It was 'does Gemini understand what you meant?'"
Five Ways to Navigate Four Image Briefs
Among 419 students with complete image portfolios, the scores cluster into five recognizable patterns.
Longer Prompts Help on Affective, Hurt on Paradox
Paradox Portrait peaks with medium-length prompts. Over-specification backfired — too many constraints narrowed the image away from paradox tension.
One Table to Save You a Week of Experimenting
Why Models Specialize — and What That Means for You
Situated Learning + Model Specialization. The model performance data is a real-world demonstration of situated learning theory. DALL·E, trained on billions of image-caption pairs, is optimized for the most common request: "generate a plausible scene." That's exactly what Concept Incarnation rewards. But Affective Chart asks something different: make an abstraction that communicates emotion without explanation. That's a harder, rarer request in training data — and DALL·E's gap (7.013 vs Ideogram's 7.637) shows it. Models aren't general-purpose intelligences. They're situated learners, shaped by the data contexts they were trained on.
The Semantic Bottleneck — a threshold concept in practice. The 3.1-point penalty for missing tradition_match in Style Transplant (score = 3.112 when anachronism is flagged vs. 8.248 when recognized) is not a grading artifact. It's the grading system correctly identifying what the brief is testing. A "Victorian painting style" with a modern iPhone in it is not a Victorian painting with a twist — it's a failure of representational understanding. The threshold isn't aesthetic; it's semantic. Cross it and everything else works. Miss it and nothing else helps.
Feynman's question: Could you explain why this image represents your concept to someone who has never seen the brief? If yes, you've cleared the semantic threshold. If no, more iterations won't help — the image doesn't know what it's supposed to mean.
For students
- Choose your model based on the brief, not on brand loyalty. Use the playbook table above.
- Before generating, test your concept: can you describe it clearly in one sentence? If you can't describe it, the model can't render it.
- "Negative prompts" hurt more than they help (negative-prompt syntax showed −0.387 score delta on Concept Incarnation). Focus your positive prompt; don't try to exclude failure modes.
For educators
- Model choice should be part of the rubric. Encourage students to explain why they chose a specific model for each brief.
- The semantic threshold is a teachable concept: show students the before/after of a concept-matched vs. concept-mismatched submission to make the gap concrete.