OPPE 2025 · Python Programming Assessment · Research Report

Inside the Exam

What happens when 13,000 students sit a programming exam three times — and what the data says we should do differently

Analysis across 2,057,658 events · 151,778 student-question attempts · Three terms, 2025

13,623Unique students

46.83%Full pass rate

26.72%Logic failures

80.95%Repeaters who improve

Somewhere in India, in April 2025, a student submits her Python exam for the third time. Three terms. Three sittings. Each one roughly 35 days after the last wave. She's not lazy — she's persistent. But something about the exam keeps catching her out, and no one is entirely sure what.

The OPPE — the Online Programming Practice Exam — tests Python coding across a national cohort of students. It runs in waves: a first sitting at the start of the term, a second sitting roughly five weeks later. Students who struggle can come back the next term. The system is designed with compassion. But compassion alone doesn't tell you why someone keeps failing, or what help would actually work.

Two million event logs later, we have answers. Not perfect ones — but actionable ones. The findings rewrote several assumptions about who struggles, how, and why. This is that story.

Act I

The Wrong Diagnosis

The working hypothesis going into this analysis was familiar and plausible: students are blocked by syntax. They can't write valid Python. Fix the syntax problem, fix the pass rate.

The data disagrees.

The chart below shows every student-question attempt, sorted into eight categories depending on where the attempt ended. Read it like a funnel — from the top pass rate down to the students who never wrote a line of code. The big bar in the middle is the one that matters.

Combined syntax gates — mechanical and fundamental — total just 9.5% of all attempts. Genuine logic failure — code that runs, that passes some tests, that looks correct but isn't — accounts for 26.72%. Logic failure is nearly three times more common than syntax failure.

"The biggest problem is not students failing to write Python syntax. The biggest problem is students writing runnable Python that gives the wrong answer."

And there's a subtler finding embedded in the waterfall: 45.49% of students who ended with non-parseable code had parseable code earlier in the same session. They broke working code while trying to fix it. That's not a knowledge failure. That's a debugging workflow failure — and it calls for a completely different intervention.

The full-pass rate, at 46.83%, is healthier than many assume. This exam is not impossible. But the path to raising it runs through logic and debugging support, not a syntax remediation campaign.

Act II

The Student Who Tries the Hardest Gets the Worst Grade

If raw effort determined outcomes, the students who ran the most tests and spent the most time coding would have the highest pass rates. They don't. Not even close.

The chart below plots every student archetype — defined by how students actually work, not just their final score — by the median time spent versus the success rate achieved. The bubble size shows how common the archetype is.

The Steady builder finishes with a 89% success rate, spending a median of 643 seconds. The Volatile reworker — who keeps restructuring code, running test after test, spending more than five times as long — ends with 36% success. The Thrasher, who runs a median of 36 public tests and spends over 75 minutes, achieves only 44%.

This is not about intelligence. Both archetypes are attempting the same problems. The difference is process: the Steady builder makes small, focused changes and tests them. The Volatile reworker makes sweeping rewrites and gets lost in a tangle of their own edits.

The Regression Archetype 7.66% of attempts belong to students who end with code that is worse than what they started with. Their success rate: 5.65%. This is the extreme case of the Volatile reworker — students who had something passable, kept editing, and destroyed it. Nearly half (45.49%) of all sessions that end in syntax errors passed through a parseable state first.

The S2 Death Spiral

The most important single number in this analysis: 78.93%. That's the probability that a student who is in State 2 — code runs, passes zero tests — will still be in State 2 on their next submission.

State 2 is the largest state in the entire dataset: 47.1% of all public test-run states. Once students enter it, they almost never escape through more of the same. They need targeted feedback. They need to know what is wrong, not just that it's wrong. Getting from S2 to S3 (some tests passing) happens only 7.18% of the time in the next run; reaching full pass (S4) happens 3.56% of the time.

Syntax errors, by contrast, clear quickly. Students with structural syntax errors resolve them within a single public run 50.33% of the time. Wrong-answer problems persist to the final submission 39.03% of the time. Syntax errors are painful but soluble. Logic errors are sticky.

Act III

What Students Don't Understand — and Why It's Not What You'd Expect

The ten concepts tested in this exam are not equally hard. The chart below shows the fraction of students who pass all public tests for each concept — from easiest to hardest. The gap is shocking.

Arithmetic and conditionals: 60% pass rate. Data analysis and aggregation: 21% pass rate. That's a 39-point gap between the easiest concept and the hardest. The hard concepts share a profile: they require students to maintain a coherent mental model across multiple functions, handle non-trivial input formats, and produce precisely formatted output.

But here's the finding that changes the teaching prescription: the data analysis failure is not primarily a selection failure. Students who are stuck in S2 (code runs, passes zero tests) are, in most cases, already using the right constructs. They picked the right tool. They just can't make the tool do the right thing.

Across S2 failures in data analysis questions: 66.8% are application-gap failures — students who chose the right approach but couldn't execute it correctly. For pattern printing, that number reaches 93.6%. This means "teach the concept again" is the wrong intervention. The intervention needed is debugging practice — worked examples, guided code reading, small-step walkthroughs.

The Loop Paradox Loops are used in 48.7% of all attempts — students know when they need a loop. But among those students, the pass rate is only 45.7%. They know the tool exists. They don't know how to use it without making it exit too early, iterate over the wrong thing, or accumulate the wrong result. High usage, low mastery = practice problem, not exposure problem.

Act IV

The Good News, the Bad News, and the 181

Here is the genuinely hopeful finding: most students improve. The chart below shows the fraction of paired students — those who sat both Wave 1 and Wave 2 in the same term — whose performance moved up, stayed the same, or declined.

Across all three terms, 57–69% of students improve between Wave 1 and Wave 2 within the same term. For students who come back the next term — the determined repeaters — 80.95% improve. Students who start blocked by syntax are progressing: more than half of those who were syntax-gated in Term 1 moved to pass-like profiles in Term 2.

The structural improvement is visible in the code itself, not just the scores. Between terms, students use significantly more for loops and if statements, and fewer print() calls as a substitute for proper output. The code is getting more competent.

"The failures are comprehensible. They have names, they have causes — and causes, unlike chaos, can be fixed."

Now the bad news.

There are 497 students who appear in all three terms — the most persistent cohort. The expectation might be that these are students stuck in S2, running code that runs but fails. That's wrong.

Of the 497 students seen across all three terms, 181 follow the same trajectory: non-parseable syntax in Term 1, non-parseable syntax in Term 2, non-parseable syntax in Term 3. Not S2. Not logic failure. Fundamental syntax. They can't yet write code that runs.

Only 3 students in this persistent cohort start from S2. The hardest-to-help group needs foundational Python instruction — typing, basic syntax, what a function looks like, how indentation works. No amount of debugging practice helps someone who can't write a parseable function.

Meanwhile, the dominant-S2 students — the ones in the death spiral — are a different group. The bad news for them: escape rate from dominant-S2 to dominant-S3/S4 is exactly 0% within-term and cross-term in this dataset. None of the students who were predominantly S2 across a term or wave moved to a predominantly S3/S4 state by looking. They need a different kind of support — structured, conceptual help — not just more run attempts.

Act V

The Exam Is Measuring the Wrong Things — Partially

The evaluation system has real strengths. Public-test overfitting — the phenomenon where students game the sample tests — is almost nonexistent: just 0.02% of submitters had public scores much higher than private scores. Formatting is not a meaningful barrier (0.25% of attempts). These are good signs.

But there are three design problems that limit both fairness and insight.

Problem 1: Test Cases Are Highly Redundant 34.46% of within-question test-case pairs are near-redundant — they test the same thing twice. The median Cronbach's alpha is 0.9716, meaning tests are consistent but overlapping. This wastes measurement capacity that could be used to probe harder edge cases.

Problem 2: The Exam Can't See Its Weakest Students 33 out of 35 question sets are flagged as "low-ability blind." The median ratio of information at low versus mid ability is 0.1555 — the exam discriminates well among average students, but provides almost no signal about weaker students. Later-term cohorts contain disproportionately more weaker students. The exam is least useful precisely where it's most needed.

Problem 3: Some Variants Aren't Equivalent Students receive different question variants, which are supposed to be interchangeable. They are not always. The largest linked theta gap between variants is 0.653 — equivalent to a meaningful shift in apparent ability. A student assigned a harder variant appears worse than they are.

What the Evaluation Does Well No meaningful public-private gaming (0.02%). No formatting tax (0.25%). The pass-through risk model predicts future outcomes with AUC 0.9193 — good enough to use for targeted support planning before each term.

Act VI — Implications

What Needs to Change

The findings in this data are not uniformly discouraging. Most students improve. Most archetypes can be redirected with the right intervention. But specific, concrete actions are needed across three domains: teaching, exam design, and platform infrastructure.

01
Fix the submission capture pipeline 23 of 35 question sets have zero recorded private submissions — a platform instrumentation failure. Any conclusion about student behavior in those sessions is unreliable. This is a critical operational problem, not a research footnote. Critical · Low effort
02
Target the S2 bottleneck with structured feedback Students stuck in "code runs, passes zero tests" need to know what is wrong — not just that it's wrong. A structured hint system, even rule-based ("Your function returns after the first iteration"), would give these students a path out. The 78.93% self-loop rate proves that more runs without better feedback don't work. Very High Impact · Moderate effort
03
Redesign test cases for difficulty spread Reduce redundant test pairs (currently 34.46%) and widen the difficulty gradient. Every question needs a warm-up case that the weakest students can pass, and a stretch case that challenges the strongest. This simultaneously improves measurement and gives partial-credit more meaning. Very High Impact · Moderate effort
04
Teach debugging process explicitly Steady builders succeed not because they're smarter — they succeed because they edit less and test more precisely. This process can be modeled explicitly: show exemplar trajectories, teach the "small edit → test → read failure → repeat" loop. The Volatile reworker doesn't need a harder problem; they need a better workflow. Very High Impact · Moderate effort
05
Design a foundational Python program for the 497 persistent cohort The majority of students who appear across all three terms are syntax-blocked, not logic-blocked. They need a different track entirely — one focused on basic program construction, not advanced debugging. Sending them to the same course content again without change is unlikely to help. High Impact · High effort
06
Focus concept support on the hardest cluster Data analysis, pattern printing, input parsing/formatting, and file operations all have pass rates below 31%. The S2 decomposition shows most failures here are application gaps — students need debugging practice and worked examples, not re-exposure to definitions. The intervention type matters as much as the intervention intensity. High Impact · Moderate effort
07
Audit and fix variant inequivalence before publishing results A theta gap of 0.653 between variants means one variant is meaningfully harder than its counterpart. Students assigned the harder variant appear less capable than they are. Fix the highest-priority pairs (py22 in Term 1, py21 in Term 2) before using scores comparatively. Very High Impact · Low effort
08
Deploy the risk model for pre-term support targeting The pass-through model (AUC 0.9193) can identify students at risk before they sit an exam. Use it: flag the highest-risk quartile before each term, and route them to appropriate support tracks rather than reactive remediation after they fail. High Impact · Moderate effort

The system that produced these results is already better than most. It runs a national programming assessment at scale, gives students multiple chances, and tracks outcomes over time. These recommendations are not a condemnation — they're the next level. The data exists. The patterns are clear. The cost of not acting is that the 181 students in the permanent S1 → S1 → S1 spiral come back for a fourth term without anything having changed for them.

They deserve better than that.

Data sources: This analysis covers 2,057,658 event logs from 13,623 unique students across three OPPE terms (25t1, 25t2, 25t3), each with two waves separated by approximately 35 days. The 151,778 student-question attempt rows are the primary unit of analysis. Waterfall percentages are calculated over all student-question attempts. Archetype analysis covers all attempts with resolved primary labels (13 categories); "Other" residual is 6.40%. Concept mastery uses public-best all-pass rate as a proxy, consistent with the psychometric model.

Caveats: 23 of 35 question sets (Track B) have zero captured private submissions — public-best scores are used as the psychometric input for these namespaces, which slightly overstates absolute mastery (14.27% of submitters have public category > private category). The 497 all-three-term cohort is a persistent subset, not the full repeater population. S2 escape rate of 0% is based on dominant-state classification (strict lens); transitions do occur at the run level. Prerequisite graph and concept tags are heuristically generated and should be cross-checked against the actual curriculum before operational use.

Steps 4–9: Waterfall (Step 4), Process archetypes (Step 5/5a), IRT psychometrics (Step 6), Evaluation redesign synthesis (Step 7), Longitudinal analysis (Step 8), Concept knowledge modelling (Step 9).