A · Inside the Exam
The Python OPPE is a 90-minute online programming exam taken by students in IIT Madras's
BS Data Science programme. Every save is timestamped. Every test run is logged. Every
submission has a score.
Students solve problems in a browser-based IDE. They can run against
public test cases whenever they want. The
private tests only run on final submission. They don't know
what the private tests contain — they only see pass/fail counts.
A · Inside the Exam
45.5% of submitters didn't get full marks. Those submissions have stories. Today we're going to read four of them.
A · Inside the Exam
For a subset of students, we recorded full session replays: every keystroke, every save,
every test run, with timestamps. This gives us something unusual — not just what
students wrote, but how they got there.
Each event is a snapshot: timestamp, event type (save / run public / run private / submit),
the full source code at that moment, and the test results.
We can watch a student's mental model evolve in real time. Or fail to evolve.
A · Inside the Exam
10D.For each question: one student who got it immediately, one who struggled productively, and one who got stuck in an interesting way.
B · Q1 · Card to Value Tuple
B · Q1 · Baseline
def card_to_value_tuple(card: str) -> tuple:
suit_map = {'S': 1, 'H': 2, 'D': 3, 'C': 4}
rank_map = {'A': 1, 'J': 11, 'Q': 12, 'K': 13,
'2': 2, '3': 3, '4': 4, '5': 5,
'6': 6, '7': 7, '8': 8, '9': 9, '10': 10}
rank = card[:-1]
suit = card[-1]
return (suit_map[suit], rank_map[rank])
card[:-1] for rank, card[-1] for suit.
The student looked at the function signature, understood that suits are always one character,
and wrote the general solution immediately. First run passes everything.
This is what reading the spec before reading the examples looks like in practice.
B · Q1 · Productive Struggle
def card_to_value_tuple(card: str) -> tuple:
color = card[-1]
if color == 'S': b = 1
elif color == 'H': b = 2
elif color == 'D': b = 3
elif color == 'C': b = 4
a = rank_map[card[:-1]]
return tuple(a, b)
This raises TypeError: tuple expected at most 1 argument, got 2.
Did you know this? I didn't!
B · Q1 · Productive Struggle
def card_to_value_tuple(card: str) -> tuple:
color = card[-1]
if color == 'S': b = 1
elif color == 'H': b = 2
elif color == 'D': b = 3
elif color == 'C': b = 4
a = rank_map[card[:-1]]
return (a, b)
Now runnable. Still wrong. (a, b) returns (rank, suit). The spec says (suit, rank).
The parentheses are fixed. The order is backwards.
B · Q1 · Productive Struggle
def card_to_value_tuple(card: str) -> tuple:
color = card[-1]
if color == 'S': b = 1
elif color == 'H': b = 2
elif color == 'D': b = 3
elif color == 'C': b = 4
a = rank_map[card[:-1]]
return (b, a)
Swap (a, b) to (b, a). Public: 4/4. Private: 4/4.
The gap between first attempt and full marks is forty seconds — mostly spent fighting the tuple
constructor and then left-right order. Two mistakes, both recoverable, both obvious in retrospect.
B · Q1 · The Hidden Boss
def card_to_value_tuple(card: str) -> tuple:
value_1 = card[1] # suit?
value_2 = card[0] # rank?
try:
value_1 = suit_values[value_1]
except:
value_1 = int(value_1)
try:
value_2 = rank_values[value_2]
except:
value_2 = int(value_2)
return (value_1, value_2)
card[1] for the suit, card[0] for the rank. Works for 'AH', '7D', 'QS', '9C'.
But '10H' destroys the parser.
B · Q2 · Check for Greeting Prefix
B · Q2 · Baseline
def starts_with_greeting(s: str) -> bool:
return s.startswith('Hello ') or s.startswith('Hi ')
The docstring says "starts with 'Hello ' or 'Hi '." The student wrote exactly that sentence in Python. Nothing more. The trailing space is there because the docstring says it's there. One second, because this is exactly what reading the spec looks like.
B · Q2 · The Trailing Space
def starts_with_greeting(s: str) -> bool:
return s.startswith('Hello') or s.startswith('Hi')
startswith('Hi') matches 'Hi friend' and also 'Hithere'. The trailing space
disambiguates them. The fix is obvious.
B · Q2 · The Trailing Space
def starts_with_greeting(s: str) -> bool:
return s.startswith('Hello ') or s.startswith('Hi ')
Two spaces added — one to each argument. Public: 4/4. Private: 100. The entire gap between 67 and 100 is two characters. This is the cleanest illustration of how test cases can reveal hidden issues.
B · Q2 · JavaScript Accent
def starts_with_greeting(s: str) -> bool:
if s.startswith('Hello' || 'Hi'):
return True
return False
|| is logical OR in JavaScript. In Python it's a syntax error.
The intent here is completely correct — the student wants OR between two strings — but the
operator is from a different language. Surface intent: perfect. Language fluency: still catching up.
B · Q2 · JavaScript Accent
def starts_with_greeting(s: str) -> bool:
if s.startswith('Hello' or 'Hi'):
return True
return False
'Hello' or 'Hi' evaluates to 'Hello', because 'Hello'
is truthy. So startswith('Hello' or 'Hi') is just startswith('Hello').
'Hi friend' now returns False. The fix — two separate startswith calls — comes a few events later.
B · Q2 · Diplomatic Patch
def starts_with_greeting(s: str) -> bool:
if s == 'Hithere':
return False
if s.startswith('Hello' or 'Hi' or 'hello' or 'hi'):
return True
elif s.startswith('Hi'):
return True
return False
The student saw that 'Hithere' should return False and their code was failing on it. So they added an explicit exception for that exact string. The public test is satisfied. The rule — trailing space — remains unimplemented. This is not debugging. It's treaty negotiations with the grader. One visible case, one explicit patch. You have seen this in a pull request.
B · Q2 · Version Control by Comment
def starts_with_greeting(s: str) -> bool:
if not isinstance(s, str):
return False
s = s.strip()
return s.startswith('Hello') or s.startswith('Hi')
'''cleaned = s.strip().lower()
is_hello = cleaned.startswith("hello ") and (
len(cleaned) == 6 or not cleaned[6].isalpha())
is_hi = cleaned.startswith("hi ") and (
len(cleaned) == 3 or not cleaned[3].isalpha())
return is_hello or is_hi'''
Student began the right solution, commented it out, and ignored it. Commenting code may be OK. Ignoring it may not.
B · Q3 · Shuffle a Three-Word Sentence
order parameter is a tuple of indices. The question is whether the student
treats it as data (general) or treats it as a pattern to match (specific).
B · Q3 · Baseline
order as Data. One Line.def shuffle_sentence(sentence: str, order: tuple) -> str:
s = sentence.split(" ")
return " ".join([s[i] for i in order])
Split, index, join. The student saw order as a general index plan
and used it directly. The abstraction is visible in three words: s[i] for i in order.
This works for any sentence, any order, any length — because the student solved the rule, not the examples.
B · Q3 · Overfitting
def shuffle_sentence(sentence: str, order: tuple) -> str:
if order == (0, 2, 1):
return 'apple orange banana'
The public tests show specific inputs with specific outputs. This student learned the first one.
B · Q3 · Overfitting
def shuffle_sentence(sentence: str, order: tuple) -> str:
if order == (0, 2, 1):
return 'apple orange banana'
elif order == (2, 1, 0):
return 'mouse dog cat'
Still learning examples. Not the rule.
B · Q3 · Overfitting
def shuffle_sentence(sentence: str, order: tuple) -> str:
if order == (0, 2, 1):
return 'apple orange banana'
elif order == (2, 1, 0):
return 'mouse dog cat'
elif order == (1, 0, 2):
return 'yellow red green'
Three public greens, zero private passes. The function is convinced that all sentence shuffling is secretly about one fruit salad. The private tests use different words.
B · Q3 · Brute Force
def shuffle_sentence(sentence: str, order: tuple) -> str:
s = sentence.split(" ")
n0, n1, n2 = s[0], s[1], s[2]
if order == (0, 1, 2): return n0 + " " + n1 + " " + n2
elif order == (0, 2, 1): return n0 + " " + n2 + " " + n1
elif order == (1, 0, 2): return n1 + " " + n0 + " " + n2
elif order == (1, 2, 0): return n1 + " " + n2 + " " + n0
elif order == (2, 0, 1): return n2 + " " + n0 + " " + n1
else: return n2 + " " + n1 + " " + n0
Six branches, one for each permutation of three words. It is correct — there are exactly six three-word orderings, and all six are here. Private: 100. Brute force wins a small, perfectly legal victory over abstraction. The student isn't wrong. They've just decided that abstraction is optional when the universe is small enough to enumerate. This is a strategy that won't survive scaling. But it worked here.
B · Q4 · Pangram Check
B · Q4 · Baseline
def is_pangram(text: str) -> bool:
alphabet = set("abcdefghijklmnopqrstuvwxyz")
return alphabet.issubset(set(text.lower()))
Build the alphabet as a set. Check if it's a subset of the unique characters in the input. The invariant — every letter must appear — maps directly to set membership. The entire logic is one expression. This student saw the mathematical structure of the problem before reaching for a loop.
B · Q4 · Find the Bug
def is_pangram(text: str) -> bool:
letters = "absdefghijklmnopqrstuvwxyz"
count = 0
for char in text.lower():
if char in letters:
count += 1
return count >= 26
Take a moment.
B · Q4 · The Checking Mechanism Needs Checking
def is_pangram(text: str) -> bool:
letters = "absdefghijklmnopqrstuvwxyz"
count = 0
for char in text.lower():
if char in letters:
count += 1
return count >= 26
"absdefghijklmnopqrstuvwxyz" — 'c' is missing, replaced by a second 's'.
The function whose entire job is to verify all 26 letters starts by not including one.
Under cognitive load, even the checking mechanism needs to be checked.
B · Q4 · The Heuristic
def is_pangram(text: str) -> bool:
alphabets = string.ascii_lowercase
text1 = text.lower().replace(' ', '')
count = 0
for i in range(len(text1)):
if text1[i] in alphabets:
count += 1
return count >= 26
Count how many letters appear in the string. If 26 or more, return True. The public tests pass because "the quick brown fox..." has well over 26 letters. The private tests do not.
B · Q4 · The Heuristic → The Fix
def is_pangram(text: str) -> bool:
alphabets = string.ascii_lowercase
text1 = text.lower().replace(' ', '')
uniq = set()
for i in range(len(text1)):
if text1[i] in alphabets:
uniq.add(text1[i])
return len(uniq) >= 26
Switch from counting total letters to counting unique letters. Private: 100.
The false summit is the moment count >= 26 passed three public tests.
The actual solution came later, when the question changed from "does my code pass?" to
"what family of inputs would break it?"
B · Q4 · The Hard Case
At event 1, Rita had the right approach: iterate through characters, check membership, return False on failure. One bug — return True was inside the loop. She didn't fix the bug. She replaced the approach entirely with count >= 26, which passed all public tests and nothing else.
It took 103 events and 1 hour, 51 minutes to find her way back to essentially the same structure — minus the indentation error.
Q4 · Pangram Check
def is_pangram(text: str) -> bool:
alphabets=string.ascii_lowercase
text1=text.lower()
for i in text1:
if i not in alphabets:
return False
return True
The right structure is already here: iterate through characters, check membership, return False on failure. One bug — return True is inside the for loop. Fix the indentation and this is a correct solution. She didn't fix it. She replaced the whole approach.
Q4 · Pangram Check
def is_pangram(text: str) -> bool:
alphabets=string.ascii_lowercase
text1=text.lower().replace(' ','')
'''for i in range(len(text1)):
if text[i] not in alphabets:
return False
return True'''
count=0
for i in range(len(text1)):
if text1[i] in alphabets:
count+=1
if count>=26:
return True
return False
The original loop code is in triple quotes — buried alive. The new approach counts every letter. If count ≥ 26, it's a pangram. The problem: "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" also has 26+ letters. The hidden tests know this. The public tests don't care.
Q4 · Pangram Check
def is_pangram(text: str) -> bool:
alphabets=string.ascii_lowercase
text1=text.lower().replace(' ','')
count=0
for i in range(len(text1)):
if text1[i] in alphabets:
count+=1
if count>=26:
return True
return False
She removed the triple-quoted graveyard. The file is clean. The logic is unchanged. The count heuristic is still wrong. Tidiness doesn't move the grader.
Q4 · Pangram Check
def is_pangram(text: str) -> bool:
alphabets=string.ascii_lowercase
text1=text.lower().replace(' ','')
'''count=0
for i in range(len(text1)):
if text1[i] in alphabets:
count+=1
if count>=26:
return True
return False'''
uniq=set()
for i in range(len(text1)):
if text1[i] in alphabets:
uniq.add(text1[i])
if len(uniq)>=26:
return True
return False
The count code goes into triple quotes. The set-based approach — collect unique letters, check if there are 26 — is live. This is structurally what she had at event 1, fixed. It took 103 events and 1:51 to get here.
B · The Pattern
Spot the pattern from the kinds of mistakes.
Section C
27,577 exam sessions · 4 navigation patterns · one very large performance gap
C · How Students Navigate
Kruskal-Wallis H=1570, p < 10−300. This is not noise.
C · Navigation Patterns
100% first-sweep coverage — they touched every question before going back to any of them. Revisit rate: 22.5%. The strategy is: survey the paper, find what you can solve, solve it, then come back for the hard ones. This is also what the test-prep industry has been telling you for decades. Most students didn't do it.
C · Navigation Patterns
First-sweep coverage: 56.9% — they start revisiting before they've seen the whole paper. Revisit rate: 50%. The problem isn't effort. They're spending time re-reading problems they haven't solved yet instead of seeing whether other questions are easier. It's studying during the exam, which is a fine idea, except the exam is also happening.
C · Navigation Patterns
Paper coverage: 71%. They started on the hardest question (Q10), couldn't solve it, jumped to easy wins (Q5, Q6, Q9), then kept returning to Q10. The approach of "find easy points first" is sound. The problem is that Q8 went unseen — and was probably solvable.
C · Navigation Patterns
Paper coverage: 50.5%. This student solved Q5 in 90 seconds, then spent the rest of the exam on Q9 and Q12. They couldn't solve either. They also couldn't see that other questions existed until it was too late. 42.8% of their moves were local toggles. The exam was two hours long. They used all of it on two wrong answers.
C · The Gap
Cliff's delta 0.32 vs Togglers. p < 10−296. Caveat: correlation, not causation — stronger students probably also navigate better. Both matter.
C · What This Means
count >= 26 passed all public tests. Students who hardcoded 'apple orange banana' passed all the examples. That gap is where most of the failing happens.Anand S · PyConf Hyderabad · 14 March 2026 · sanand0.github.io/talks/2026-03-14-how-students-learn-python/