Data Analysis · Programming Education · OPPE 2025

The Error Fingerprints

How 38,000 students fail the same way — and what that reveals about how we teach code

Analysis of 2,057,658 events · 60 questions · 13,623 students

–Final submissions

–Did not pass every test

–Distinct error patterns

Somewhere in India, on an exam night, a student finishes her Python function. She runs it against the three sample test cases in the problem statement. All three pass. She presses Submit.

Zero points.

What just happened to her has a name. In fact, it has several names. And across 38,683 final code submissions to a national programming assessment, we can tell you exactly which name applies to which student — with fingerprint-like regularity.

This is not a story about students who didn't try. It's a story about students who tried the wrong thing, in ways so systematic and repetitive that the same five mistakes surface across completely different questions, in different exam sessions, from different students, term after term.

Act I

A Pattern, Not Noise

When 4 in 10 final submissions fail, the natural assumption is diversity — a thousand different misunderstandings scattered in a thousand different directions. The data says otherwise.

We analyzed 60 distinct programming questions — from simple arithmetic to multi-function data analysis tasks — covering 90.1% of all final submissions. What emerged was not chaos. It was a taxonomy. The chart below shows every question, sorted from hardest to easiest. Notice the spread.

"A perfect 50–50: 666 students passed the pangram check, 665 failed it. The most democratic failure in the dataset."

Some questions fail fewer than 5% of students. Others fail more than 90%. The easiest — "Check is even or divisible by 5" — is a single boolean expression. The hardest — YouTube Video Engagement Analysis — requires four interdependent helper functions, precise decimal formatting, and careful data-structure reasoning. The difference isn't just difficulty. It's the type of mistake being made.

Act II

Seven Fingerprints

When you name an error precisely enough, patterns start to speak. Across 1,317 distinct error patterns in 60 questions, seven archetypes account for virtually all failures. The donut below shows their relative weight — sized by total error count, not unique students, so a single submission can contribute to multiple categories.

Each archetype has its own character, its own cause, and its own remedy. The next two acts examine the two most instructive in close detail.

Act III

The Mimic: When Passing Tests Isn't the Same as Solving the Problem

The task sounds straightforward: given a three-word sentence and a permutation tuple, return the words in the new order. A function named shuffle_sentence(sentence, order).

The sample inputs in the problem statement used three specific permutations: (0,2,1), (2,1,0), and (1,0,2). Those three permutations share a remarkable property — they are all self-inverse. Apply the permutation, apply it again, you get the original order back. This means a student could write code implementing the wrong algorithm — the inverse permutation instead of the forward permutation — and still pass every single sample test.

That's exactly what happened. 34% of failing shuffle submissions hard-coded behavior for the known permutations or implemented the inverse permutation. Here’s one student’s submission. At first glance it looks reasonable — it uses the order parameter, builds a result list, formats the output. But line 7 stores l[a] at position x[a] instead of placing word order[i] at position i. It's the inverse permutation. For the public tests — all self-inverse — the result is identical. For private tests with cyclic permutations like (2,0,1): wrong answer. Score: 33 out of 100.

The simpler variant of this mistake is the explicit if-elif chain: if order == (0, 2, 1): return ... for each known permutation. Memorizing the answer sheet rather than solving the problem. Same mechanism, more visible.

The Mimic in Numbers Hard-coded sample outputs appear as a significant error type in at least 23 of the 60 question clusters. In the shuffle sentence problem: 72 of 212 failing students (34%) were Mimics. In the pangram check: 77 students (11.6%) hard-coded sample pangram strings rather than computing letter coverage. The pattern transcends the specific question.

Why does this happen? Because the test feedback loop teaches exactly the wrong lesson. Every sample test that passes reinforces the belief that the approach is correct. The private tests — which probe the general case — are invisible until submission. By then, it's too late.

Act IV

The Quitter: One Loop, One Return, 243 Students

A pangram is a sentence containing every letter of the alphabet. "The quick brown fox jumps over the lazy dog" is the canonical example. The task: write is_pangram(text) — return True if the sentence is a pangram, False otherwise.

Of 1,331 students who submitted a final answer, exactly 665 failed. A near-perfect 50–50. Within those 665 failures, one pattern so dominates the others that it rewrites the story of the question entirely.

"243 students made the same mistake. Not a similar mistake — the exact same structural error, reproduced across two cohorts."

The bug: returning True or False inside the loop that checks each letter. The loop is supposed to check all 26 letters before concluding anything. The return statement exits the function after checking exactly one — usually the first character. The code is valid Python. It runs without errors. It even gives the right answer sometimes — when the first character of the text is alphabetic it returns True, when it's not it returns False. It just doesn't solve the stated problem.

Here’s the bug in action — a function that looks like it's checking all 26 letters, but actually returns after checking just the first one.

This is The Quitter: correct structure, premature conclusion. The intuition behind it is locally reasonable: "If the letter is here, it's a pangram — return True. If not — return False." That reasoning is right for one letter in isolation. A pangram requires all 26, which means you must finish the loop before concluding anything.

The pattern appears in at least 8 other question clusters under slightly different labels: "returns inside loop before completing full check," "stops after first match," "decides after first iteration." It is the same cognitive model error, wearing different clothes across different problems.

The fix is a single structural change: move the return False outside the loop, and change the return True inside to just continue. But students don't see that because their mental model says "check → decide" rather than "check all → decide once."

Act V

The Hardest Room: Why Some Questions Fail 95% of Students

The scatter plot below maps each question by its popularity (students attempted) against its failure rate. Most questions cluster in a band between 20% and 60% failure. A handful sit in the upper right — where 80%, 90%, even 95% of students who submitted a final answer still failed.

The questions with the highest failure rates share a profile: multiple helper functions, non-trivial data formats, and output that requires precise formatting. They test whether students can maintain a coherent mental model across a complex system — not just implement a single algorithm.

YouTube Video Engagement Analysis is the hardest: 514 of 542 students (95%) failed. The single most common failure — 103 students — wasn't a conceptual error. It was not rounding to 2 decimal places. One hundred and three students wrote structurally correct code — right formula, right data access — and failed everything because they skipped round(rate, 2). That's not a conceptual failure. It's a spec-reading failure.

The Two Other Hard Rooms Chess Game Analysis (C110) and Batsman Performance Analysis (C108) each failed 94% of students. Both involve multiple interdependent helper functions. The most common pattern in both: leaving template placeholder code (... or pass) in required helpers — students who ran out of time or couldn't figure out one part and submitted incomplete work. A bug in one helper propagates through all downstream functions.

The contrast with the easiest questions is instructive. "Check is even or divisible by 5" failed just 11% of students. "Compute Electricity Bill" failed 15%. These questions have single functions, numerical inputs, no formatting requirements. They test one concept at a time. Students can verify them easily against the samples.

Act VI — Implications

What Should Change

The patterns in this data are not just interesting. They are actionable. Here is what the fingerprints suggest:

01
Design tests that break the most common shortcuts The pangram sample tests didn't include a sentence starting with a non-alphabetic character — so the early-return bug passed all of them. Every public test should be paired with at least one private test that specifically invalidates the most common shortcut. For shuffle, that means cyclic permutations. For pangrams, a sentence starting with a digit. For electricity bills, a reading in the middle of a slab boundary.
02
Categorize feedback, not just scores "Wrong Answer" tells a student nothing useful. "Your function returns after checking only the first element" is a diagnosis. The pattern data in this analysis exists — it can power automated feedback categories. Students who receive "Your code appears to hard-code sample outputs" are far more likely to fix the right thing than students who receive only a red X.
03
Teach spec-reading as a first-class skill The rounding error that sank 103 YouTube submissions suggests students skim problem statements for "what to compute" and miss "how to format it." Explicit pre-coding checklists — what is the exact return type? what formatting is required? what edge cases are implied? — can be taught explicitly. They are now as important as syntax.
04
Scaffold multi-function problems Questions with 90%+ failure rates almost always involve multiple interdependent helpers (see Chess Game Analysis and Batsman Performance Analysis). A bug in one function makes all subsequent functions untestable; in practice, many students submit with placeholder helpers still in place. Consider providing tested stubs, letting students implement one function at a time, or offering partial credit per helper tested independently.
05
In the AI era: test for robustness, not just correctness AI tools generate code that looks polished and passes samples while silently assuming a narrower problem. This is the Mimic pattern, automated. The solution isn't to ban AI — it's to design evaluations that specifically probe the general case, edge cases, and exact output contracts. A test suite that breaks AI-generated solutions is a better test suite for everyone.

There is something hopeful embedded in all of this. The errors are systematic, which means they are teachable. The pangram bug isn't a mystery — it's a mental model mismatch about when to exit a loop. Once named, it can be addressed directly. The Mimic isn't dishonesty — it's a misaligned incentive created by visible sample tests and invisible private ones. Once the system changes, the behavior changes.

The data is optimistic. Not because the failure rate is low — it isn't. But because the failures are comprehensible. They have names. They have causes. And causes, unlike chaos, can be fixed.

Appendix

Cluster List

This table mirrors the cluster index and adds click-through context. Click any row to open a richer, JSON-backed cluster overview popup with links to patterns, variants, questions, and the full report. Click any canonical question ID to open the question statement, test cases, and available error statistics.

Loading cluster index…

Cluster	Question (canonical)	Variants	Submitters	Non-full	Key error summary (analyzed)
Loading cluster list…

Methodology: This analysis covers 60 of 163 question clusters from the OPPE 2025 programming assessment, representing 90.1% of all final submissions (38,683 of 42,918 total). Error patterns were generated by LLM-assisted analysis of actual student code, then validated and clustered by pattern type. The 1,317 named patterns represent distinct failure modes across these questions, with 1,441 representative code examples extracted from real student submissions.

A "final submission" is a student's last evaluated private-test submission for a question. Students who only ran public tests are excluded from the denominator. "Full pass" means 100% of private test cases passed. Failure rate is non_full / final_submitters per question cluster. Archetype classification uses keyword matching on pattern names; ambiguous patterns default to "Logic Error."

Caveat: This analysis spans academic year 2025 across multiple terms and waves. Error patterns across cohorts are remarkably consistent — which increases confidence — but temporal and selection effects cannot be fully ruled out. The 103 questions not yet analyzed may show different patterns.