How 38,000 students fail the same way — and what that reveals about how we teach code
Somewhere in India, on an exam night, a student finishes her Python function. She runs it against the three sample test cases in the problem statement. All three pass. She presses Submit.
Zero points.
What just happened to her has a name. In fact, it has several names. And across 38,683 final code submissions to a national programming assessment, we can tell you exactly which name applies to which student — with fingerprint-like regularity.
This is not a story about students who didn't try. It's a story about students who tried the wrong thing, in ways so systematic and repetitive that the same five mistakes surface across completely different questions, in different exam sessions, from different students, term after term.
When 4 in 10 final submissions fail, the natural assumption is diversity — a thousand different misunderstandings scattered in a thousand different directions. The data says otherwise.
We analyzed 60 distinct programming questions — from simple arithmetic to multi-function data analysis tasks — covering 90.1% of all final submissions. What emerged was not chaos. It was a taxonomy. The chart below shows every question, sorted from hardest to easiest. Notice the spread.
Some questions fail fewer than 5% of students. Others fail more than 90%. The easiest — "Check is even or divisible by 5" — is a single boolean expression. The hardest — YouTube Video Engagement Analysis — requires four interdependent helper functions, precise decimal formatting, and careful data-structure reasoning. The difference isn't just difficulty. It's the type of mistake being made.
When you name an error precisely enough, patterns start to speak. Across 1,317 distinct error patterns in 60 questions, seven archetypes account for virtually all failures. The donut below shows their relative weight — sized by total error count, not unique students, so a single submission can contribute to multiple categories.
Each archetype has its own character, its own cause, and its own remedy. The next two acts examine the two most instructive in close detail.
The task sounds straightforward: given a three-word sentence and a permutation tuple, return the words in the new order. A function named shuffle_sentence(sentence, order).
The sample inputs in the problem statement used three specific permutations: (0,2,1), (2,1,0), and (1,0,2). Those three permutations share a remarkable property — they are all self-inverse. Apply the permutation, apply it again, you get the original order back. This means a student could write code implementing the wrong algorithm — the inverse permutation instead of the forward permutation — and still pass every single sample test.
That's exactly what happened. 34% of failing shuffle submissions hard-coded behavior for the known permutations or implemented the inverse permutation. Here’s one student’s submission. At first glance it looks reasonable — it uses the order parameter, builds a result list, formats the output. But line 7 stores l[a] at position x[a] instead of placing word order[i] at position i. It's the inverse permutation. For the public tests — all self-inverse — the result is identical. For private tests with cyclic permutations like (2,0,1): wrong answer. Score: 33 out of 100.
The simpler variant of this mistake is the explicit if-elif chain: if order == (0, 2, 1): return ... for each known permutation. Memorizing the answer sheet rather than solving the problem. Same mechanism, more visible.
Why does this happen? Because the test feedback loop teaches exactly the wrong lesson. Every sample test that passes reinforces the belief that the approach is correct. The private tests — which probe the general case — are invisible until submission. By then, it's too late.
A pangram is a sentence containing every letter of the alphabet. "The quick brown fox jumps over the lazy dog" is the canonical example. The task: write is_pangram(text) — return True if the sentence is a pangram, False otherwise.
Of 1,331 students who submitted a final answer, exactly 665 failed. A near-perfect 50–50. Within those 665 failures, one pattern so dominates the others that it rewrites the story of the question entirely.
The bug: returning True or False inside the loop that checks each letter. The loop is supposed to check all 26 letters before concluding anything. The return statement exits the function after checking exactly one — usually the first character. The code is valid Python. It runs without errors. It even gives the right answer sometimes — when the first character of the text is alphabetic it returns True, when it's not it returns False. It just doesn't solve the stated problem.
Here’s the bug in action — a function that looks like it's checking all 26 letters, but actually returns after checking just the first one.
This is The Quitter: correct structure, premature conclusion. The intuition behind it is locally reasonable: "If the letter is here, it's a pangram — return True. If not — return False." That reasoning is right for one letter in isolation. A pangram requires all 26, which means you must finish the loop before concluding anything.
The pattern appears in at least 8 other question clusters under slightly different labels: "returns inside loop before completing full check," "stops after first match," "decides after first iteration." It is the same cognitive model error, wearing different clothes across different problems.
The fix is a single structural change: move the return False outside the loop, and change the return True inside to just continue. But students don't see that because their mental model says "check → decide" rather than "check all → decide once."
The scatter plot below maps each question by its popularity (students attempted) against its failure rate. Most questions cluster in a band between 20% and 60% failure. A handful sit in the upper right — where 80%, 90%, even 95% of students who submitted a final answer still failed.
The questions with the highest failure rates share a profile: multiple helper functions, non-trivial data formats, and output that requires precise formatting. They test whether students can maintain a coherent mental model across a complex system — not just implement a single algorithm.
YouTube Video Engagement Analysis is the hardest: 514 of 542 students (95%) failed. The single most common failure — 103 students — wasn't a conceptual error. It was not rounding to 2 decimal places. One hundred and three students wrote structurally correct code — right formula, right data access — and failed everything because they skipped round(rate, 2). That's not a conceptual failure. It's a spec-reading failure.
... or pass) in required helpers — students who ran out of time or couldn't figure out one part and submitted incomplete work. A bug in one helper propagates through all downstream functions.
The contrast with the easiest questions is instructive. "Check is even or divisible by 5" failed just 11% of students. "Compute Electricity Bill" failed 15%. These questions have single functions, numerical inputs, no formatting requirements. They test one concept at a time. Students can verify them easily against the samples.
The patterns in this data are not just interesting. They are actionable. Here is what the fingerprints suggest:
There is something hopeful embedded in all of this. The errors are systematic, which means they are teachable. The pangram bug isn't a mystery — it's a mental model mismatch about when to exit a loop. Once named, it can be addressed directly. The Mimic isn't dishonesty — it's a misaligned incentive created by visible sample tests and invisible private ones. Once the system changes, the behavior changes.
The data is optimistic. Not because the failure rate is low — it isn't. But because the failures are comprehensible. They have names. They have causes. And causes, unlike chaos, can be fixed.
This table mirrors the cluster index and adds click-through context. Click any row to open a richer, JSON-backed cluster overview popup with links to patterns, variants, questions, and the full report. Click any canonical question ID to open the question statement, test cases, and available error statistics.
| Cluster | Question (canonical) | Variants | Submitters | Non-full | Key error summary (analyzed) |
|---|---|---|---|---|---|
| Loading cluster list… | |||||
Methodology: This analysis covers 60 of 163 question clusters from the OPPE 2025 programming assessment, representing 90.1% of all final submissions (38,683 of 42,918 total). Error patterns were generated by LLM-assisted analysis of actual student code, then validated and clustered by pattern type. The 1,317 named patterns represent distinct failure modes across these questions, with 1,441 representative code examples extracted from real student submissions.
A "final submission" is a student's last evaluated private-test submission for a question. Students who only ran public tests are excluded from the denominator. "Full pass" means 100% of private test cases passed. Failure rate is non_full / final_submitters per question cluster. Archetype classification uses keyword matching on pattern names; ambiguous patterns default to "Logic Error."
Caveat: This analysis spans academic year 2025 across multiple terms and waves. Error patterns across cohorts are remarkably consistent — which increases confidence — but temporal and selection effects cannot be fully ruled out. The 103 questions not yet analyzed may show different patterns.