The Model Playbook · TDS Jan 2026

In 2026, 765 students generated nearly two thousand AI images and had them all evaluated by Gemini. The results didn't match the marketing. The best model for one creative brief was the worst for another. The map of task fit had been hiding in plain sight.

The Setup

Everyone Had a Favorite, and Most People Used It for Everything

DALL·E dominated the dataset with 935 image rows — nearly 54% of all evaluated images. Most students had a single preferred model and used it across all four briefs. This is the natural behavior: pick a tool, learn it well, apply it everywhere. It's efficient. It's also wrong.

No Model Wins Every Brief — The Best Choice Rotates

6.5 (low) 9.0 (high) ★ = brief winner

Each row has a different winner. Ideogram leads Affective. Midjourney leads Concept. Gemini leads Paradox. Ideogram and Midjourney tie Style. DALL·E, the most-used model, wins none.

The Complication

DALL·E's Split Personality

DALL·E scores 8.557 on Concept Incarnation — among the highest in the dataset. On Affective Chart, it scores 7.013 with a 43.5% exhibition rate. That gap — 1.5 points, nearly two standard deviations — is not noise. It's structure.

The Affective Chart brief requires something that DALL·E appears poorly calibrated for: making an image that feels emotionally resonant and data-informative simultaneously. It's the brief that most demands constraint adherence alongside visual originality. DALL·E's tendency toward polished, slightly literal interpretations works against it here.

"DALL·E wins Concept Incarnation and loses Affective Chart. The gap is 1.5 points. In grading terms, that's the difference between a strong pass and a borderline one."

Every Brief Has a Different Winner

The model rankings reshuffle completely across briefs. No single model dominates all four.

The Hidden Grading Axis

Semantic Match Is the Real Threshold

The rubric has dimensions — composition, emotional resonance, technical execution. But the data reveals a hidden axis that cuts across all of them: does the evaluator recognize what you intended? The data calls this 'semantic match' — whether the generated image demonstrably contains the required conceptual element.

When semantic match fails, scores collapse.

Miss the Semantic Target and Lose 3–5 Points

For Style Transplant, failing tradition_match means a score of 3.1 vs 8.2 — the largest single predictive gap in the entire image evaluation dataset.

The Revelation

The Semantic Threshold: Not 'Is It Beautiful?' But 'Did Gemini Understand It?'

This reveals something important about how the evaluation worked. Gemini was not just scoring beauty or technical craft. It was checking: does this image instantiate the concept the brief called for? A stunning image that missed the concept could score below 6. An average image that nailed it could score above 8.

This means the optimal strategy was not to make the most beautiful image — it was to make sure the image communicated its intent clearly enough for an AI evaluator to recognize it.

"The question wasn't 'is this beautiful?' It was 'does Gemini understand what you meant?'"

Student Archetypes

Five Ways to Navigate Four Image Briefs

Among 419 students with complete image portfolios, the scores cluster into five recognizable patterns.

Prompt Engineering

Longer Prompts Help on Affective, Hurt on Paradox

Paradox Portrait peaks with medium-length prompts. Over-specification backfired — too many constraints narrowed the image away from paradox tension.

The Playbook

One Table to Save You a Week of Experimenting

Theory & Takeaways

Why Models Specialize — and What That Means for You

Situated Learning + Model Specialization. The model performance data is a real-world demonstration of situated learning theory. DALL·E, trained on billions of image-caption pairs, is optimized for the most common request: "generate a plausible scene." That's exactly what Concept Incarnation rewards. But Affective Chart asks something different: make an abstraction that communicates emotion without explanation. That's a harder, rarer request in training data — and DALL·E's gap (7.013 vs Ideogram's 7.637) shows it. Models aren't general-purpose intelligences. They're situated learners, shaped by the data contexts they were trained on.

The Semantic Bottleneck — a threshold concept in practice. The 3.1-point penalty for missing tradition_match in Style Transplant (score = 3.112 when anachronism is flagged vs. 8.248 when recognized) is not a grading artifact. It's the grading system correctly identifying what the brief is testing. A "Victorian painting style" with a modern iPhone in it is not a Victorian painting with a twist — it's a failure of representational understanding. The threshold isn't aesthetic; it's semantic. Cross it and everything else works. Miss it and nothing else helps.

Feynman's question: Could you explain why this image represents your concept to someone who has never seen the brief? If yes, you've cleared the semantic threshold. If no, more iterations won't help — the image doesn't know what it's supposed to mean.

For students

Choose your model based on the brief, not on brand loyalty. Use the playbook table above.
Before generating, test your concept: can you describe it clearly in one sentence? If you can't describe it, the model can't render it.
"Negative prompts" hurt more than they help (negative-prompt syntax showed −0.387 score delta on Concept Incarnation). Focus your positive prompt; don't try to exclude failure modes.

For educators

Model choice should be part of the rubric. Encourage students to explain why they chose a specific model for each brief.
The semantic threshold is a teachable concept: show students the before/after of a concept-matched vs. concept-mismatched submission to make the gap concrete.