Quick Fixes · OPPE 2025 · Python Programming Assessment

Fix the Test,
Not the Student

How 8 small changes to programming exam test cases could rescue 1 in 4 failing students — without rewriting a single question

16,402 Students who didn't pass
~4,118 Potentially addressable
8 Targeted fixes
~1 day To implement P0 fixes

Imagine you're a student. It's exam night. You've spent three hours on this Python function. You test it against every example in the problem statement — three cases, all pass. You hit Submit.

Zero points.

Now imagine that same scenario plays out across 1,348 students — all of whom wrote code that passed every visible test, all of whom got zero on tests they'd never seen. Not because their logic was wrong. Because the visible tests were designed in a way that let a bad strategy look like a good one.

That's not a student problem. That's a test-design problem. And it's fixable — sometimes in an afternoon.


Act I

The 16,000 Who Didn't Quite Make It

In the 2025 OPPE Python assessment — three terms, a national cohort — 16,402 final submissions came back without a perfect score. Of those, 83.9% were analyzed across 60 question clusters covering most of the exam.

What the analysis found wasn't random noise. Failures clustered into recognizable patterns: hardcoding visible examples, stopping loops too early, misreading file conventions, formatting output slightly wrong. Pattern after pattern, question after question, term after term.

The key insight was quieter than it sounds. Many of these patterns don't reveal a student who failed to understand the concept. They reveal a student who understood the concept but was trained by the public tests to do the wrong thing. The test taught a bad strategy, then penalized the student for learning it.

Every fix, ranked by how many students it could rescue
Bar width = estimated non-full rows affected (upper-bound). Click any row for details and code examples.
P0 — deploy today
P1 — this sprint
P2 — release guardrail

Seven of these eight fixes require nothing more than a few extra test cases or a sentence in the problem statement. No rubric changes. No concept changes. Just closing the gap between what the visible tests reward and what the hidden tests actually measure.

"The public tests didn't just fail to catch bad strategies — they actively taught them."

The eighth fix — the variant-equivalence guardrail — is different. It's a release process check, not a test case. But it may be the most important of all. In one question alone, 65% of an entire variant cohort effectively solved a different problem than the other variant. That's not a student failure. That's a system failure.


Act II

The Four Patterns That Matter Most

Four families of failure dominate the P0 list — the ones that can be fixed today and will reach the most students. Each has a name, a mechanism, and a fix that takes hours, not weeks.


Act III

What If You Did All Eight?

Select any combination of fixes below and see their cumulative potential impact. Note that estimates are upper-bounds and patterns may overlap — real gain is likely 60–80% of the shown figure.

Impact Calculator

Combined, the eight fixes could address up to ~25% of all non-full final submissions. That's roughly 1 in 4 students who failed — not because they couldn't code, but because the exam's own test suite left a gap between what they saw and what was measured.

The research is unambiguous here. John Hattie's 2007 meta-analysis of 800 studies found that specific, immediate feedback is among the highest-impact interventions in education. When a student fails a hidden test they've never seen, they get no actionable feedback — just a score. These fixes turn invisible private failures into visible public signals. Students can learn from them. That's not a test-design detail. That's a pedagogy decision.


Act IV

The Playbook: Fastest First

Ordered by speed and reach — the same-day fixes first, then the sprint work, then the release pipeline addition.


What we'd want to confirm Affected-row counts are upper-bounds — patterns within a fix may overlap, and the same student may be counted under multiple fixes. Real-world gain is likely 60–80% of the headline estimate. The analysis also covers 60 of 163 question clusters (83.9% of submissions); the remaining 20% of failures are not yet characterized. And adding public tests changes what students practice, not just what they see — a positive effect, but worth monitoring for unintended coaching in future terms.

Dataerrors.json (6.7 MB; 60 clusters, 16,402 non-full final submissions) + ERRORS.md (cluster index) + individual cluster reports.

Prioritization — Impact = affected non-full rows linked to a fixable pattern family (upper-bound, may overlap). Effort = faculty/content team hours to change prompt/test JSON and evaluator behavior.

Evidence baseHattie & Timperley (2007) on the power of feedback; Hao et al. (2019) on immediate feedback in programming education; Alkafaween et al. (2024) on LLM-generated test suites; Sweller & Cooper (1985) on worked examples.

Cluster reports — Individual files (ERRORS-cluster-c*.md) contain exhaustive pattern inventories with student code samples and private-case breakdowns.