This repository contains a Playwright-powered scraper that finds the Academy Award for Best Director winners with the most wins, visits each director’s biography page, normalizes their infobox details, and exports the enriched data to top-directors.csv
.
This repository was generated using these 3 prompts on Codex CLI:
Use Playwright to:
- Find the top 5 directors who got the most best-director Academy Awards by scraping https://en.wikipedia.org/wiki/Academy_Award_for_Best_Director. (Pick randomly for tie-breakers).
- For each, go through their bio page and extract bio details for each.
- Think of a logical structure that will capture all relevant information for them
- Write a Python script `scrape-directors.py` that will extract this information from any director's page
- Run it for the top 5 directors' pages and save the results in a top-directors.csv
Write unit tests for this.
Generate a README.md documenting
- What this repo does and how to use it
- The prompt used to generate it
- How you (Codex CLI) used playwright to generate the script (use your conversation history + thinking for this)
- How this process can be extended to generate scrapers for any content from any site
- What worked well and what didn't - and therefore, what best practices to follow.
Expand the `## How Codex CLI Built This with Playwright` section into much more detail,
explaining step-by-step how Codex worked to solve this problem.
Look at the logs and cite verbatim from conversation history.
UV_CACHE_DIR=.uv-cache uv run playwright install firefox
UV_CACHE_DIR=.uv-cache uv run python scrape-directors.py --limit 5 --seed 0 --output top-directors.csv
UV_CACHE_DIR=.uv-cache uv run --extra dev pytest
The resulting top-directors.csv
contains cleaned infobox fields plus a JSON snapshot of the full infobox per director. All logic lives in scrape_directors.py
; scrape-directors.py
is the Typer entrypoint.
UV_CACHE_DIR=.uv-cache uv run playwright install firefox
When a network block interrupted dependency resolution, the CLI surfaced the exact error before retrying with approval.
error: Failed to fetch:
https://pypi.org/simple/pytest/
page.goto(‘https://en.wikipedia.org/wiki/Academy_Award_for_Best_Director’, wait_until=’domcontentloaded’) Path(‘fixtures’).mkdir(exist_ok=True) Path(‘fixtures/best_director.html’).write_text(html, encoding=’utf-8’)
Repeated the pattern for director biographies to compare infobox layouts without repeated network calls.
page.goto(url, wait_until=’domcontentloaded’) Path(f’fixtures/{name}.html’).write_text(page.content(), encoding=’utf-8’)
lxml
to confirm how wins propagate down the rowspan and to derive deterministic tie-handling.
table_nodes = doc.xpath(“//h3[@id=’Multiple_wins’]/parent::div/following-sibling::table[1]”) result = [(‘John Ford’, 4), (‘Frank Capra’, 3), (‘William Wyler’, 3), (‘Steven Spielberg’, 2), (‘George Stevens’, 2)]
remove
with drop_tree()
preserved trailing text (e.g., birth dates) while stripping hidden spans.
cleaned: <td class="infobox-data">
February 1, 1894
<div style="display:inline" class="birthplace">Cape Elizabeth, Maine, U.S.</div></td> [‘February 1, 1894’, ‘Cape Elizabeth, Maine’, ‘, U.S.’]
entries = parse_top_directors(html, top_n=5, seed=0) assert names == [ “John Ford”, “Frank Capra”, “William Wyler”, “Steven Spielberg”, “George Stevens”, ]
records = _scrape_top_directors(limit, seed) typer.run(main)
UV_CACHE_DIR=.uv-cache uv run python scrape-directors.py –limit 5 –seed 0 –output top-directors.csv
These steps combined Playwright’s navigation with lxml
parsing, fixture-driven iteration, and seeded randomness to deliver a reproducible scraper.
lxml
/CSS selectors until the data model is stable.uv run python - <<'PY'
snippets exposed parsing bugs (e.g., drop_tree
vs. remove
) before they reached the tests.getparent().remove()
dropped trailing text (lost birth dates). Switching to drop_tree()
preserved tails—prefer it when cleaning HTML trees.uv run scrape-directors.py
failed. Always confirm CLI ergonomics.Following these practices makes it straightforward to adapt the approach to scrape other structured or semi-structured web data with Playwright and Python.