OPPE 2025 · Python Programming Assessment · Research Report

Inside the Exam

What happens when 13,000 students sit a programming exam three times — and what the data says we should do differently

13,623Unique students
46.83%Full pass rate
26.72%Logic failures
80.95%Repeaters who improve

Somewhere in India, in April 2025, a student submits her Python exam for the third time. Three terms. Three sittings. Each one roughly 35 days after the last wave. She's not lazy — she's persistent. But something about the exam keeps catching her out, and no one is entirely sure what.

The OPPE — the Online Programming Practice Exam — tests Python coding across a national cohort of students. It runs in waves: a first sitting at the start of the term, a second sitting roughly five weeks later. Students who struggle can come back the next term. The system is designed with compassion. But compassion alone doesn't tell you why someone keeps failing, or what help would actually work.

Two million event logs later, we have answers. Not perfect ones — but actionable ones. The findings rewrote several assumptions about who struggles, how, and why. This is that story.


Act I

The Wrong Diagnosis

The working hypothesis going into this analysis was familiar and plausible: students are blocked by syntax. They can't write valid Python. Fix the syntax problem, fix the pass rate.

The data disagrees.

The chart below shows every student-question attempt, sorted into eight categories depending on where the attempt ended. Read it like a funnel — from the top pass rate down to the students who never wrote a line of code. The big bar in the middle is the one that matters.

Where 100% of student-question attempts end — the gating waterfall
Hover any bar for details. Syntax gates (combined) account for less than 10% of all attempts.

Combined syntax gates — mechanical and fundamental — total just 9.5% of all attempts. Genuine logic failure — code that runs, that passes some tests, that looks correct but isn't — accounts for 26.72%. Logic failure is nearly three times more common than syntax failure.

"The biggest problem is not students failing to write Python syntax. The biggest problem is students writing runnable Python that gives the wrong answer."

And there's a subtler finding embedded in the waterfall: 45.49% of students who ended with non-parseable code had parseable code earlier in the same session. They broke working code while trying to fix it. That's not a knowledge failure. That's a debugging workflow failure — and it calls for a completely different intervention.

The full-pass rate, at 46.83%, is healthier than many assume. This exam is not impossible. But the path to raising it runs through logic and debugging support, not a syntax remediation campaign.


Act II

The Student Who Tries the Hardest Gets the Worst Grade

If raw effort determined outcomes, the students who ran the most tests and spent the most time coding would have the highest pass rates. They don't. Not even close.

The chart below plots every student archetype — defined by how students actually work, not just their final score — by the median time spent versus the success rate achieved. The bubble size shows how common the archetype is.

Effort vs. outcomes by student archetype
Each bubble = one archetype. Size = share of all attempts. X = median active time. Y = success rate. Hover for details.

The Steady builder finishes with a 89% success rate, spending a median of 643 seconds. The Volatile reworker — who keeps restructuring code, running test after test, spending more than five times as long — ends with 36% success. The Thrasher, who runs a median of 36 public tests and spends over 75 minutes, achieves only 44%.

This is not about intelligence. Both archetypes are attempting the same problems. The difference is process: the Steady builder makes small, focused changes and tests them. The Volatile reworker makes sweeping rewrites and gets lost in a tangle of their own edits.

The Regression Archetype 7.66% of attempts belong to students who end with code that is worse than what they started with. Their success rate: 5.65%. This is the extreme case of the Volatile reworker — students who had something passable, kept editing, and destroyed it. Nearly half (45.49%) of all sessions that end in syntax errors passed through a parseable state first.

The S2 Death Spiral

The most important single number in this analysis: 78.93%. That's the probability that a student who is in State 2 — code runs, passes zero tests — will still be in State 2 on their next submission.

State 2 is the largest state in the entire dataset: 47.1% of all public test-run states. Once students enter it, they almost never escape through more of the same. They need targeted feedback. They need to know what is wrong, not just that it's wrong. Getting from S2 to S3 (some tests passing) happens only 7.18% of the time in the next run; reaching full pass (S4) happens 3.56% of the time.

State 2 is nearly inescapable — the transition probabilities
S2: code runs but passes 0 public tests (47.1% of all public runs). The self-loop is 78.93%.

Syntax errors, by contrast, clear quickly. Students with structural syntax errors resolve them within a single public run 50.33% of the time. Wrong-answer problems persist to the final submission 39.03% of the time. Syntax errors are painful but soluble. Logic errors are sticky.


Act III

What Students Don't Understand — and Why It's Not What You'd Expect

The ten concepts tested in this exam are not equally hard. The chart below shows the fraction of students who pass all public tests for each concept — from easiest to hardest. The gap is shocking.

Concept mastery (% passing all public tests) — sorted easiest to hardest
A 39-point gap separates arithmetic (easiest) from data analysis (hardest). Hover for details.

Arithmetic and conditionals: 60% pass rate. Data analysis and aggregation: 21% pass rate. That's a 39-point gap between the easiest concept and the hardest. The hard concepts share a profile: they require students to maintain a coherent mental model across multiple functions, handle non-trivial input formats, and produce precisely formatted output.

But here's the finding that changes the teaching prescription: the data analysis failure is not primarily a selection failure. Students who are stuck in S2 (code runs, passes zero tests) are, in most cases, already using the right constructs. They picked the right tool. They just can't make the tool do the right thing.

Across S2 failures in data analysis questions: 66.8% are application-gap failures — students who chose the right approach but couldn't execute it correctly. For pattern printing, that number reaches 93.6%. This means "teach the concept again" is the wrong intervention. The intervention needed is debugging practice — worked examples, guided code reading, small-step walkthroughs.

The Loop Paradox Loops are used in 48.7% of all attempts — students know when they need a loop. But among those students, the pass rate is only 45.7%. They know the tool exists. They don't know how to use it without making it exit too early, iterate over the wrong thing, or accumulate the wrong result. High usage, low mastery = practice problem, not exposure problem.

Act IV

The Good News, the Bad News, and the 181

Here is the genuinely hopeful finding: most students improve. The chart below shows the fraction of paired students — those who sat both Wave 1 and Wave 2 in the same term — whose performance moved up, stayed the same, or declined.

Within-term improvement: Wave 1 → Wave 2, three terms
Weighted by number of questions attempted. Hover for exact figures.

Across all three terms, 57–69% of students improve between Wave 1 and Wave 2 within the same term. For students who come back the next term — the determined repeaters — 80.95% improve. Students who start blocked by syntax are progressing: more than half of those who were syntax-gated in Term 1 moved to pass-like profiles in Term 2.

The structural improvement is visible in the code itself, not just the scores. Between terms, students use significantly more for loops and if statements, and fewer print() calls as a substitute for proper output. The code is getting more competent.

"The failures are comprehensible. They have names, they have causes — and causes, unlike chaos, can be fixed."

Now the bad news.

There are 497 students who appear in all three terms — the most persistent cohort. The expectation might be that these are students stuck in S2, running code that runs but fails. That's wrong.

The 497 persistent students — what their state trajectories actually show
Top dominant-state trajectories across all three terms. S1 = non-parseable syntax. S0 = no code at all.

Of the 497 students seen across all three terms, 181 follow the same trajectory: non-parseable syntax in Term 1, non-parseable syntax in Term 2, non-parseable syntax in Term 3. Not S2. Not logic failure. Fundamental syntax. They can't yet write code that runs.

Only 3 students in this persistent cohort start from S2. The hardest-to-help group needs foundational Python instruction — typing, basic syntax, what a function looks like, how indentation works. No amount of debugging practice helps someone who can't write a parseable function.

Meanwhile, the dominant-S2 students — the ones in the death spiral — are a different group. The bad news for them: escape rate from dominant-S2 to dominant-S3/S4 is exactly 0% within-term and cross-term in this dataset. None of the students who were predominantly S2 across a term or wave moved to a predominantly S3/S4 state by looking. They need a different kind of support — structured, conceptual help — not just more run attempts.


Act V

The Exam Is Measuring the Wrong Things — Partially

The evaluation system has real strengths. Public-test overfitting — the phenomenon where students game the sample tests — is almost nonexistent: just 0.02% of submitters had public scores much higher than private scores. Formatting is not a meaningful barrier (0.25% of attempts). These are good signs.

But there are three design problems that limit both fairness and insight.

Problem 1: Test Cases Are Highly Redundant 34.46% of within-question test-case pairs are near-redundant — they test the same thing twice. The median Cronbach's alpha is 0.9716, meaning tests are consistent but overlapping. This wastes measurement capacity that could be used to probe harder edge cases.
Problem 2: The Exam Can't See Its Weakest Students 33 out of 35 question sets are flagged as "low-ability blind." The median ratio of information at low versus mid ability is 0.1555 — the exam discriminates well among average students, but provides almost no signal about weaker students. Later-term cohorts contain disproportionately more weaker students. The exam is least useful precisely where it's most needed.
Problem 3: Some Variants Aren't Equivalent Students receive different question variants, which are supposed to be interchangeable. They are not always. The largest linked theta gap between variants is 0.653 — equivalent to a meaningful shift in apparent ability. A student assigned a harder variant appears worse than they are.
What the Evaluation Does Well No meaningful public-private gaming (0.02%). No formatting tax (0.25%). The pass-through risk model predicts future outcomes with AUC 0.9193 — good enough to use for targeted support planning before each term.

Act VI — Implications

What Needs to Change

The findings in this data are not uniformly discouraging. Most students improve. Most archetypes can be redirected with the right intervention. But specific, concrete actions are needed across three domains: teaching, exam design, and platform infrastructure.

The system that produced these results is already better than most. It runs a national programming assessment at scale, gives students multiple chances, and tracks outcomes over time. These recommendations are not a condemnation — they're the next level. The data exists. The patterns are clear. The cost of not acting is that the 181 students in the permanent S1 → S1 → S1 spiral come back for a fourth term without anything having changed for them.

They deserve better than that.

Data sources: This analysis covers 2,057,658 event logs from 13,623 unique students across three OPPE terms (25t1, 25t2, 25t3), each with two waves separated by approximately 35 days. The 151,778 student-question attempt rows are the primary unit of analysis. Waterfall percentages are calculated over all student-question attempts. Archetype analysis covers all attempts with resolved primary labels (13 categories); "Other" residual is 6.40%. Concept mastery uses public-best all-pass rate as a proxy, consistent with the psychometric model.

Caveats: 23 of 35 question sets (Track B) have zero captured private submissions — public-best scores are used as the psychometric input for these namespaces, which slightly overstates absolute mastery (14.27% of submitters have public category > private category). The 497 all-three-term cohort is a persistent subset, not the full repeater population. S2 escape rate of 0% is based on dominant-state classification (strict lens); transitions do occur at the run level. Prerequisite graph and concept tags are heuristically generated and should be cross-checked against the actual curriculum before operational use.

Steps 4–9: Waterfall (Step 4), Process archetypes (Step 5/5a), IRT psychometrics (Step 6), Evaluation redesign synthesis (Step 7), Longitudinal analysis (Step 8), Concept knowledge modelling (Step 9).