What happens when 13,000 students sit a programming exam three times — and what the data says we should do differently
Somewhere in India, in April 2025, a student submits her Python exam for the third time. Three terms. Three sittings. Each one roughly 35 days after the last wave. She's not lazy — she's persistent. But something about the exam keeps catching her out, and no one is entirely sure what.
The OPPE — the Online Programming Practice Exam — tests Python coding across a national cohort of students. It runs in waves: a first sitting at the start of the term, a second sitting roughly five weeks later. Students who struggle can come back the next term. The system is designed with compassion. But compassion alone doesn't tell you why someone keeps failing, or what help would actually work.
Two million event logs later, we have answers. Not perfect ones — but actionable ones. The findings rewrote several assumptions about who struggles, how, and why. This is that story.
The working hypothesis going into this analysis was familiar and plausible: students are blocked by syntax. They can't write valid Python. Fix the syntax problem, fix the pass rate.
The data disagrees.
The chart below shows every student-question attempt, sorted into eight categories depending on where the attempt ended. Read it like a funnel — from the top pass rate down to the students who never wrote a line of code. The big bar in the middle is the one that matters.
Combined syntax gates — mechanical and fundamental — total just 9.5% of all attempts. Genuine logic failure — code that runs, that passes some tests, that looks correct but isn't — accounts for 26.72%. Logic failure is nearly three times more common than syntax failure.
And there's a subtler finding embedded in the waterfall: 45.49% of students who ended with non-parseable code had parseable code earlier in the same session. They broke working code while trying to fix it. That's not a knowledge failure. That's a debugging workflow failure — and it calls for a completely different intervention.
The full-pass rate, at 46.83%, is healthier than many assume. This exam is not impossible. But the path to raising it runs through logic and debugging support, not a syntax remediation campaign.
If raw effort determined outcomes, the students who ran the most tests and spent the most time coding would have the highest pass rates. They don't. Not even close.
The chart below plots every student archetype — defined by how students actually work, not just their final score — by the median time spent versus the success rate achieved. The bubble size shows how common the archetype is.
The Steady builder finishes with a 89% success rate, spending a median of 643 seconds. The Volatile reworker — who keeps restructuring code, running test after test, spending more than five times as long — ends with 36% success. The Thrasher, who runs a median of 36 public tests and spends over 75 minutes, achieves only 44%.
This is not about intelligence. Both archetypes are attempting the same problems. The difference is process: the Steady builder makes small, focused changes and tests them. The Volatile reworker makes sweeping rewrites and gets lost in a tangle of their own edits.
The most important single number in this analysis: 78.93%. That's the probability that a student who is in State 2 — code runs, passes zero tests — will still be in State 2 on their next submission.
State 2 is the largest state in the entire dataset: 47.1% of all public test-run states. Once students enter it, they almost never escape through more of the same. They need targeted feedback. They need to know what is wrong, not just that it's wrong. Getting from S2 to S3 (some tests passing) happens only 7.18% of the time in the next run; reaching full pass (S4) happens 3.56% of the time.
Syntax errors, by contrast, clear quickly. Students with structural syntax errors resolve them within a single public run 50.33% of the time. Wrong-answer problems persist to the final submission 39.03% of the time. Syntax errors are painful but soluble. Logic errors are sticky.
The ten concepts tested in this exam are not equally hard. The chart below shows the fraction of students who pass all public tests for each concept — from easiest to hardest. The gap is shocking.
Arithmetic and conditionals: 60% pass rate. Data analysis and aggregation: 21% pass rate. That's a 39-point gap between the easiest concept and the hardest. The hard concepts share a profile: they require students to maintain a coherent mental model across multiple functions, handle non-trivial input formats, and produce precisely formatted output.
But here's the finding that changes the teaching prescription: the data analysis failure is not primarily a selection failure. Students who are stuck in S2 (code runs, passes zero tests) are, in most cases, already using the right constructs. They picked the right tool. They just can't make the tool do the right thing.
Across S2 failures in data analysis questions: 66.8% are application-gap failures — students who chose the right approach but couldn't execute it correctly. For pattern printing, that number reaches 93.6%. This means "teach the concept again" is the wrong intervention. The intervention needed is debugging practice — worked examples, guided code reading, small-step walkthroughs.
Here is the genuinely hopeful finding: most students improve. The chart below shows the fraction of paired students — those who sat both Wave 1 and Wave 2 in the same term — whose performance moved up, stayed the same, or declined.
Across all three terms, 57–69% of students improve between Wave 1 and Wave 2 within the same term. For students who come back the next term — the determined repeaters — 80.95% improve. Students who start blocked by syntax are progressing: more than half of those who were syntax-gated in Term 1 moved to pass-like profiles in Term 2.
The structural improvement is visible in the code itself, not just the scores. Between terms, students use significantly more for loops and if statements, and fewer print() calls as a substitute for proper output. The code is getting more competent.
Now the bad news.
There are 497 students who appear in all three terms — the most persistent cohort. The expectation might be that these are students stuck in S2, running code that runs but fails. That's wrong.
Of the 497 students seen across all three terms, 181 follow the same trajectory: non-parseable syntax in Term 1, non-parseable syntax in Term 2, non-parseable syntax in Term 3. Not S2. Not logic failure. Fundamental syntax. They can't yet write code that runs.
Only 3 students in this persistent cohort start from S2. The hardest-to-help group needs foundational Python instruction — typing, basic syntax, what a function looks like, how indentation works. No amount of debugging practice helps someone who can't write a parseable function.
Meanwhile, the dominant-S2 students — the ones in the death spiral — are a different group. The bad news for them: escape rate from dominant-S2 to dominant-S3/S4 is exactly 0% within-term and cross-term in this dataset. None of the students who were predominantly S2 across a term or wave moved to a predominantly S3/S4 state by looking. They need a different kind of support — structured, conceptual help — not just more run attempts.
The evaluation system has real strengths. Public-test overfitting — the phenomenon where students game the sample tests — is almost nonexistent: just 0.02% of submitters had public scores much higher than private scores. Formatting is not a meaningful barrier (0.25% of attempts). These are good signs.
But there are three design problems that limit both fairness and insight.
The findings in this data are not uniformly discouraging. Most students improve. Most archetypes can be redirected with the right intervention. But specific, concrete actions are needed across three domains: teaching, exam design, and platform infrastructure.
The system that produced these results is already better than most. It runs a national programming assessment at scale, gives students multiple chances, and tracks outcomes over time. These recommendations are not a condemnation — they're the next level. The data exists. The patterns are clear. The cost of not acting is that the 181 students in the permanent S1 → S1 → S1 spiral come back for a fourth term without anything having changed for them.
They deserve better than that.
Data sources: This analysis covers 2,057,658 event logs from 13,623 unique students across three OPPE terms (25t1, 25t2, 25t3), each with two waves separated by approximately 35 days. The 151,778 student-question attempt rows are the primary unit of analysis. Waterfall percentages are calculated over all student-question attempts. Archetype analysis covers all attempts with resolved primary labels (13 categories); "Other" residual is 6.40%. Concept mastery uses public-best all-pass rate as a proxy, consistent with the psychometric model.
Caveats: 23 of 35 question sets (Track B) have zero captured private submissions — public-best scores are used as the psychometric input for these namespaces, which slightly overstates absolute mastery (14.27% of submitters have public category > private category). The 497 all-three-term cohort is a persistent subset, not the full repeater population. S2 escape rate of 0% is based on dominant-state classification (strict lens); transitions do occur at the run level. Prerequisite graph and concept tags are heuristically generated and should be cross-checked against the actual curriculum before operational use.
Steps 4–9: Waterfall (Step 4), Process archetypes (Step 5/5a), IRT psychometrics (Step 6), Evaluation redesign synthesis (Step 7), Longitudinal analysis (Step 8), Concept knowledge modelling (Step 9).