In the workshop, Anand opened Codex in front of a hundred students, typed four words — "Solve this exam" — and let it run in the background while he talked. It scored 9/10. A student named Jaideep, working from a slightly different angle, got it to 10/10. That much you already know.
This is the part the room never got to see: what actually happened inside those two runs. Both were recorded — every command, every tool call, every score the page handed back. Reading the logs is like watching two chess engines play the same opening and diverge. And the most surprising thing is not that Codex won. It's how it won.
"Codex did not solve the exam like a student reading questions and reasoning aloud. It solved it like a forward-deployed engineer. It found the question source, read the validators, generated local solvers, used browser automation to fill answers, used the live 'Check' buttons as feedback, and recovered from errors until the page accepted the answers."
— the post-hoc analysis of both runs
Hold onto that distinction, because it is the whole story. A chat model, handed an exam, tries to know the answers. A coding agent tries to operate the environment that grades them. The exam — ten questions, ten marks, each one explicitly designed to test an AI-era skill — turned out to be full of affordances an engineer could grab: source code, validators, error messages, downloadable files, external APIs, and a "Check" button that told you the truth on every click.
Two Prompts, Two Accounts
The two runs started from almost the same instruction — and the small difference in wording previews the whole personality of each agent.
Anand, on ChatGPT Plus, gave Codex effectively two words: solve the exam at this URL. Jaideep, on ChatGPT Pro, was a touch more procedural:
Jaideep's prompt to Codexusing @chrome visit `https://exam.sanand.workers.dev/2026-06-test` go through every question one by one then solve them till you get the correct answer then submit it
"Go through every question one by one… solve them till you get the correct answer." That phrase — till you get the correct answer — is the entire trick of a verifiable environment, handed to the agent as an instruction. It tells Codex to loop: try, check, repair, repeat. Neither prompt told it how to solve a single question. Both agents had to discover that for themselves. Both ran gpt-5.5 at the default "medium" reasoning effort.
It Read the Source Before It Read the Questions
The first thing Codex did was the thing a student would never think to do: it ignored the questions and went looking for the machinery behind them. The exam at exam.sanand.workers.dev isn't a static page — it's a client-side app that generates a seeded variant of each question per user, checks answers in the browser, and hides the grading logic in bundled JavaScript. To a chat model that's a wall. To a coding agent it's a filing cabinet.
I found no useful prior memory, so I'm proceeding from live state.
What it settled into was a loop you could write on an index card — and which the post-hoc analysis distilled into six steps:
The winning behaviour — closer to software engineering than to prompting1. Read the source. # find the per-question validator 2. Extract the validator contract. # what EXACT input does it accept? 3. Write a small solver. # node/python to compute the answer 4. Use feedback. # click "Check", read the result 5. Repair precisely. # fix to the validator, not to taste 6. Save only after verification. # never submit an unconfirmed answer
This is the deep reason Codex was strong and a copy-paste chatbot was weak. The exam was stuffed with feedback — source code, validators, error strings, downloadable archives, an external game API, and a Check button that returned ground truth on every press. A chat model has to be right. An agent in a verifiable environment only has to become right. As Anand put it in the room: "Looping with feedback is possibly the most powerful technique that AI can use."
What It Cost: Roughly the Price of a Coffee
Before the move-by-move, the question everyone actually asks: how much did this cost? Both runs were on ChatGPT subscriptions, so neither put a number on the screen — but the logs record every token. Priced at gpt-5.5's standard API list rates, here is the bill for letting an agent sit a ten-question exam:
The shape of that bill is the surprising part. An agentic loop re-sends its whole working context every turn, so the runs burned 10–14 million input tokens — but ~96% of them were cached, billed at a tenth of the rate. Actual new output — the reasoning and code Codex wrote — was a mere ~42,000 tokens. Do the arithmetic and the whole exam costs about the price of a coffee: $2.66 for nine marks, $2.21 for ten. On a $20/month plan it barely dents the monthly allowance. (Anand stopped his run at 9/10 not because it was expensive in dollars, but because it was eating into his plan's included credits during a live demo — a quota worry, not a cost worry.)
Ten marks on a deliberately AI-resistant exam, for the price of a flat white. The expensive part of using a coding agent is not the tokens. It's the judgment to know when it's wrong.
— the economics of the two runs, in one line
Ten Questions, Ten Marks: The Move-by-Move
The exam's source manifest labels each question with the capability it tests — "Verifying AI-generated work," "Detecting hallucinations at scale," "Problem breakdown and efficient experimentation," and so on. Here's how Codex handled each, including where it tripped and how it got back up. Watch for a pattern: almost every failure was a wording or formula mismatch, and almost every recovery came from reading the validator's own source.
A randomly chosen buggy JavaScript function — "most frequent element," "validate email," "flatten array," "compound interest." You must return JSON with the bugs, the fixed code, and a test strategy; the validator then runs your function against generated cases.
Jaideep's variant was "most frequently occurring element," where the bad code returned the largest value instead of the most common one:
the planted bug — returns the max, not the modefunction mostFrequent(arr) { let counts = {}, max = 0; for (const x of arr) { counts[x] = (counts[x] || 0) + 1; } for (const x of arr) { if (x > max) max = x; } // ← bug: tracks value, not frequency return max; }
Codex's fix — track the most frequent keyfunction mostFrequent(arr) { const counts = {}; let best = arr[0], bestCount = 0; for (const x of arr) { counts[x] = (counts[x] || 0) + 1; if (counts[x] > bestCount) { bestCount = counts[x]; best = x; } } return best; }
It identified the real defect, wrote the corrected function, and named edge cases (empty array, ties) in the test strategy.
✓ Passed first try · "practically free marks"Write exactly N yes/no checks that separate GOOD, MEDIOCRE, and POOR outputs. The validator then spends real money calling an AIPipe LLM judge many times over. This is the one question gated by a credential — and it's where the two runs split.
Jaideep's variant judged SQL query quality; he wrote checks for CTEs, null handling, joins, filtering, grouping. But the page had no token, so the judge couldn't run. Codex tried to inject one programmatically:
recovery — two failed injections, then the native promptglobalThis.aiPipeToken = ""; // ✗ page global is locked window.aiPipeToken = " "; // ✗ still rejected // → Codex falls back to the page's own prompt() dialog and pastes it there ✓
Once the token went in through the front door, Q2 validated and Jaideep hit 10/10. Anand's variant judged API documentation quality; Codex wrote six clean output-only checks — method/path, content type, authentication, a concrete request example, success behaviour, specific error statuses. The rubric content was accepted. But the judge rejected the bearer token outright:
A prompt is missing exactly three of six load-bearing components — audience, output format, data grounding, tone/style, length, action. The validator hunts for both the missing component labels and concrete detection phrases. Generic "make it a good prompt" repairs failed.
Codex stopped guessing and read the component matcher, then inserted the exact constraints it was scanning for — the seeded target metric forecast_error_pct, an "NYT graphics team" style, and a literal numeric length:
the kind of phrase the validator actually greps forAudience: data-literate executives, not analysts Grounding: lead with the primary metric forecast_error_pct Style: in the style of an NYT graphics-team explainer Length: between 110 and 160 words
The lesson the analysis draws: context-engineering questions aren't about writing well. They're about preserving constraints exactly.
↻ Recovered by reading the validator's required phrasesA table plus a paragraph with planted false figures. Fix only the wrong ones; leave the true ones — and a decoy — untouched. Codex regenerated the seeded data, computed the correct values, and found the planted errors (April units, February revenue-per-unit, a reversed May→June return-rate direction, July's online order share).
Then a beautifully brittle failure. The validator fingerprints answers by digit-substring. Codex's correct revenue figure $105,128 contains the substring 512 — which collided with the fingerprint for the July-share check and made a right answer read as wrong:
a correct number that trips a substring fingerprint"...Q4 revenue rose to $105,128..." # contains "512" "...July online share reached 51.2%" # fingerprint: look for "512" # → the wrong claim matches the right number
Codex's fix was pure engineering, not reasoning: it reordered the paragraph so the July share appeared before the Q4 revenue value, breaking the collision. Even when the maths is right, validators can be brittle — and an agent that reads the feedback adapts anyway.
↻ Recovered by reordering the prose to dodge a fingerprint clashBroken Chart.js HTML with a deceptive axis — truncated, dual-scaled, inverted, or log-compressed. Submit a corrected chart, a distortion value within tolerance, and the exact words describing the manipulation. Codex's explanations were right in spirit but missed the literal phrase and syntax the grader wanted.
Jaideep's was a log-scale case; the validator wanted a literal axis type, not JSON-escaped:
Jaideep's variant — validator wants literal `type: "linear"`scales: { y: { type: "linear" } } // not "type: \"linear\""
Anand's was an inverted y-axis that made declining satisfaction look like a rise. Codex read the accepted-phrase set in the source and dropped in the exact string:
Download a ZIP of 1,000 Python files. Exactly one calls only real APIs; the other 999 each hide one to three hallucinated calls. Find the clean one. Codex did not read 1,000 files — it built a cheap static-analysis pipeline. But its first, naive allow-list was too permissive: subtle fakes like response.code, json.ParseError, or timedelta(weeks_days=7) slipped through.
The recovery is the most elegant move in either run. Codex realised every file is a mutation of the same template, so the genuine API at each position is simply the most common one across all 1,000 variants:
Codex's positional-mode pipeline (Python, paraphrased)from collections import Counter # group the 1000 scripts by template, align mutable API positions position_votes = [Counter() for _ in range(num_positions)] for script in scripts: for pos, call in enumerate(api_calls(script)): position_votes[pos][call] += 1 # the real API at each position is the modal (most frequent) choice real_api = [votes.most_common(1)[0][0] for votes in position_votes] # the answer = the single file that uses the real API at EVERY position answer = next(s for s in scripts if all(c == real_api[p] for p, c in enumerate(api_calls(s))))
Jaideep's run reached the same answer for his seed (script_152.py) by evaluating the generator directly. This is exactly the scalable behaviour the question was built to reward.
↻ Recovered by replacing a brittle allow-list with positional votingWrite a Hypothesis property test that fails on a seeded buggy implementation but passes on the correct one. The validator runs it in Pyodide and checks whether the bug surfaces within 1,000 generated examples. Codex didn't hope random search would stumble onto the bug — it read the seeded scenario (the danger zone was "any window containing exactly one zero") and forced the counterexample:
a property test aimed straight at the risky regionfrom hypothesis import given, strategies as st @given(st.lists(st.integers(min_value=0, max_value=9), min_size=1)) def test_window_handles_single_zero(xs): # bias generation toward windows with exactly one zero assume(xs.count(0) == 1) assert rolling_metric(xs) == reference(xs) # fails on the buggy impl
Passed cleanly in both runs. Property-based testing turns out to be a great agent task: it converts a vague "find edge cases" into executable, randomised proof.
✓ Passed both runs — no significant mistakesThe most agentic question. An external game gives an anchor node, a small query budget, and clues; you must name the compromised node and the shortest path to it, and submit a signed completion token. Codex opened the game, read its inline JavaScript, and discovered the API:
the game's hidden endpoints, found by reading the bundlePOST /detective/start # begin a session, get query budget GET /detective/node/{id} # inspect one node (costs a query) GET /detective/sample # sample the graph POST /detective/submit # submit node + path POST /detective/clear # ← the escape hatch
It scored each node against the clues — huge outgoing volume, very few transactions, a tiny in/out ratio, few counterparties, high average transaction size. Node 91 was unambiguous:
why node 91 was the culpritnode 91 → volume 24,000 · tx_count 3 · in/out 0.03 · counterparties 5 · avg_tx 9,480
Here Jaideep's run made its biggest blunder: it guessed too early — nodes 5, 49, 22 — and the game ends a session after three wrong guesses. The recovery was detective work in itself. Codex found that the failed state persisted by email and week, then probed the /detective/clear endpoint until it found the right arguments:
recovering a dead session, one parameter at a timePOST /detective/clear # ✗ needs a week POST /detective/clear?week=2026-W24 # ✗ needs the session token POST /detective/clear?week=2026-W24 + token # ✓ session reset — clean budget
Both runs converged on node 91 via path 13→72→91. This was the question that most exposed the gap between a chat model (which would reason vaguely) and a coding agent (which used APIs, state, graph search, scoring, and server feedback).
↻ Jaideep recovered a dead game via /detective/clear · Anand one-shot the final submitThe heaviest-weighted question. Given 21 prompt fragments with per-model scores and word counts, find the shortest set of fragment IDs that clears both a macro-mean and a per-model floor across four models. Anand's run made the single cleanest mistake of the exam: it treated the model scores as plain percentages, when the validator first passes them through a sigmoid.
the formula Codex first missed — found in the validator sourceconst sigmoid = (x) => 1 / (1 + Math.exp(-x)); // raw scores are LOGITS, not % // first solver: averaged raw scores → validator rejected the mean // fixed solver: sigmoid(score) first → exhaustive search → passes
With the right transform, the exhaustive search returned a valid minimal set (Anand's seed: I8, I11, I13, I17, I20; Jaideep's: I7, I9, I10, I14, I15, I19). The lesson the analysis underlines: optimisation is where AI is strongest when it writes code — and weakest when it guesses the maths.
↻ Recovered by reading the sigmoid in the source and re-running the searchBroken chart HTML with the wrong palette family. Explain the mismatch in an HTML comment and submit corrected HTML using the right encoding — sequential, categorical, or diverging. The validator checks perceptual properties: monotonic lightness, CIEDE2000 distances, diverging-midpoint lightness, and no rainbow.
Jaideep's variant was elevation around sea level — a textbook diverging case. Anand's was hospital wait times, where a categorical palette falsely implied each hospital was an unrelated bucket rather than a point on a fast→slow continuum. Codex's first explanation was too generic and failed; the fix required naming the exact false implication:
↻ Recovered by naming the precise perceptual mismatch the rubric expectedthe explanation the grader demanded — name the false implication<!-- Wait time is a continuous quantity (fast → slow), but the categorical palette implies each hospital is an unrelated category. Replace with a sequential palette so lightness increases monotonically with wait time. -->
Anand's run saved its verified result and stopped: