On the fifth of March, 1899, a group of English-speaking students in Zürich arrived at the Allmend — a military commons they used as a football pitch — and found an official notice pinned to the gate. The military directorate of the canton had banned all play on the training ground until further notice. It was too late to reach anyone. The age of mobile phones was still a century away. The match against FC Basel went ahead anyway, with only seven men on the Zürich side.
Players trickled in as the game progressed. The referee, scheduled for the afternoon, never appeared — no one had told him the kickoff had been brought forward to the morning. The crowd, according to a contemporary account, consisted of "approximately 10 to 20 spectators." By the report's own judgment: under such circumstances, such an important match should not have been played.
The Anglo-American Football Club of Zürich — chemistry students at the Federal Polytechnic School, the institution that would become ETH Zürich — won 10 to nothing. Their center forward, Robert Collinson, scored eight of those goals himself. The club went on to win Switzerland's first-ever official national football championship that season. They had also, six years earlier, been present at the founding of the Swiss Football Association itself.
That story — all 316 words of it — lives in the Wikipedia article for Rudolf Schwarz. And in the article for Ernst Gass. And Hans Billeter, and Georges Fürstenberger, and R. Sommer, and Ernst-Alfred Thalmann, and Adolf Rittmann, and Emanuel Schiess. Eleven different players, eleven different Wikipedia pages, one word-for-word identical account of a single winter afternoon 127 years ago.
"Eleven articles sharing 316 words about a match played before a crowd of 10 to 20 spectators."
This seemed extraordinary — until we found the paragraph that appears not eleven times, but 418.
Every time a new range of asteroids gets officially named, a Wikipedia editor creates an article. The article might be called "Meanings of minor-planet names: 269001–270000" or "Meanings of minor-planet names: 380001–381000." There are 418 such pages, tracking the names of numbered asteroids from the tens of thousands up past 380,000. And at the top of every single one sits the same 213-word paragraph. It explains how the International Astronomical Union's Minor Planet Center assigns permanent numbers to space rocks. It cites the JPL Small-Body Database and the IAU Working Group for Small Bodies Nomenclature. It mentions a German astronomer named Lutz D. Schmadel — born 1942 in Berlin, died October 21, 2016, Heidelberg — who spent his career at the Astronomisches Rechen-Institut compiling every naming citation into his Dictionary of Minor Planet Names, edition after edition, until the end of his life.
The same 213 words. In 418 separate articles. This is what the data turns up when you ask: what is the longest passage in all of English Wikipedia that appears in the most articles?
A 35-Gigabyte Mirror of Everything
The Wikimedia Foundation publishes all of English Wikipedia as structured data on Hugging Face — 86 Parquet files, a columnar format built for large-scale queries. Downloaded, they occupy 34.61 gigabytes. Inside: 7.6 million rows, one per article, with structured fields for the abstract, sections, references, infoboxes, tables, and editor metadata.
The sections field — a JSON blob containing the full article body — is where the text lives. The median article runs about 5,000 characters. The shortest are stubs of three characters; the longest, a list of government schools in New South Wales, approaches a million. Seventeen percent of the most recent edits were made by bots.
To find what echoes most, we extracted 190 million text fragments — every paragraph-length and sentence-length piece of text — and hashed each one. For each hash, we counted how many distinct articles it appeared in. Then we ranked by a score combining length and frequency: words × log₂(articles + 1). Long text in many articles scores highest. Short text even in thousands of articles scores less. The top 30 results are what you see below.
The Diversity Spectrum
The 30 most-repeated long passages range from pure boilerplate — an 81-word census note copied into 2,920 Slovak village stubs — to something stranger: 316 words about an 1899 football match, shared across eleven individual player articles. Each circle is one passage. Horizontal position shows article diversity (how different the carrier articles are from each other). Vertical position is the interestingness score. Circle size scales with article count. Click any circle to read the passage.
Five Stories Inside the Data
Most of what the analysis surfaces is expected: Wikipedia has hundreds of nearly-identical pages about Spanish Senate constituencies, Ohio townships, and NFL playoff seasons, and they share boilerplate explanations of their respective systems. The interesting exceptions are the passages that shouldn't repeat — that appear in contexts so different you'd never predict they'd share a sentence.
#1 by score
The Asteroid Preamble
The highest-scoring repeated passage is also the most predictable: the standard introductory paragraph for every "Meanings of minor-planet names" article. Wikipedia has 418 such pages, one for each thousand-asteroid range from the low thousands to past 380,000. Every one opens with the same 213-word explanation of how the International Astronomical Union assigns permanent numbers, how discoverers can then propose names, and how Lutz D. Schmadel compiled all the naming citations into his dictionary at the University of Heidelberg's Astronomisches Rechen-Institut until his death in October 2016.
The paragraph is doing necessary work: it contextualizes what these articles are. But its 418-way repetition also points to the scale of solar system cataloguing. The Minor Planet Center has now confirmed more than 600,000 numbered objects. The naming articles cover only a fraction — but already generate 418 near-identical introductions, and the count grows with every new discovery batch.
#12 · most diverse
The Zurich Game
The most surprising repeated passage is also one of the longest: a 316-word account of the Anglo-American Football Club's 1898–99 season — banned practice ground, missing referee, seven-man start, and the 10–0 rout that announced Switzerland's future champions to anyone who cared to watch. The club, formed by chemistry students at the Federal Polytechnic School, had attended the founding meeting of the Swiss Football Association in 1895 and won its first national title in 1899.
This passage doesn't appear in an article about the club. It appears in the Wikipedia pages for eleven individual players connected to that season — their articles otherwise distinct in career paths, biographical detail, and club affiliation. They share, word for word, this one episode of collective memory: a match played before 10 to 20 spectators that became the founding myth of Swiss football.
What makes this the most interesting entry isn't its score (modest, at 1,133) but its position on the diversity axis — the furthest right point on the chart. Eleven separate biographical subjects sharing a single narrative passage. It's the closest Wikipedia comes to a campfire story.
#25 · widest reach
The Slovak Census Note
The passage with the single largest article count is not the most interesting text we found. It is a methodological footnote: an 81-word note explaining why the population figure at the top of a Slovak municipality article may differ from the census figure in the table below. The reason: a student may be officially registered in their home village but spend most of their time studying in the city.
This note appears in 2,920 Wikipedia articles about Slovak villages and municipalities — roughly one for every settlement in the country. The entire Slovak municipal presence on Wikipedia is connected by a single paragraph about the limitations of permanent residence data. It is, in a way, the purest expression of the phenomenon: a sentence of bureaucratic clarification, copied faithfully across thousands of nearly identical stubs, doing invisible consistency work at national scale.
#2 · high score + high diversity
The Proteasome Shadow
The second-highest-scoring passage is the most surprising outlier in the chart: a 307-word explanation of the ubiquitin-proteasome system (UPS) — the cellular mechanism that disposes of damaged or misfolded proteins — appearing across 26 different gene articles. PSMA3, PSMB1, PSMA5, PSMD12, PSMB3, PSMC1: each is a different molecular component of the proteasome, with different functions and different disease associations. Their Wikipedia articles are genuinely distinct. But they all share nearly 2,250 characters explaining UPS's role in cancer, neurodegeneration, Alzheimer's, Parkinson's, ALS, Huntington's disease, and inflammatory response.
This is science boilerplate at its most pervasive, and also its most understandable. Every writer who touches a proteasome subunit article reaches for this paragraph because it efficiently establishes why the UPS matters. The problem: a factual update to any of those 307 words now needs to be made 26 times, in 26 separate places, or quietly diverges across articles until no one is sure which version is current.
#23
The Andor Backstory
The repetition problem is not confined to academic or governmental content. A 281-word history of how the Disney+ series Andor came to be — from Bob Iger's February 2018 announcement through Tony Gilroy replacing Stephen Schiff as showrunner, to the COVID-19 delay that forced Gilroy to manage production remotely from New York — appears in ten separate Wikipedia articles, one per episode of the show's first season.
Each episode article covers a different story, a different setting, a different moment in the narrative. But all ten share the identical origin paragraph in their "Background" sections. It's a reminder that Wikipedia's serialized episode-page structure naturally generates this pattern: the same contextual paragraph, pasted before each new chapter of any long-running work.
The Residue of Knowledge
Think of Wikipedia as a coral reef. From above, it looks like 7.6 million individual organisms — each article its own shape and color, its own story. But underneath, they share a skeleton. The repeated passages we found are that skeleton: paragraphs doing the connective work that keeps the encyclopedia coherent, establishing context every time a new article touches a common system.
Most of that skeleton is dull infrastructure. The same note about Slovak census methodology in 2,920 village pages. The same township governance paragraph in 1,197 Ohio articles. The same explanation of the D'Hondt method in 110 Spanish constituency articles. The same Judiciary Act of 1789 text in 309 Supreme Court case lists. This is how encyclopedias work — and how they must work, to be consistent at scale.
But some of it is more interesting. The hill fort passage, shared across 33 different ancient British earthworks — each a different place, each containing archaeologist Barry Cunliffe's argument about iron-age social stress and shifting trade routes. The superheavy element passage, explaining the island of stability theory, appearing in 25 element articles from Nihonium to Lawrencium. And the 1898–99 football story, living in eleven separate biographical pages — the most diverse set of carrier articles in the entire top 30.
These aren't just boilerplate. They're moments where editors writing about genuinely different things reached for the same explanation, because it was the right one. The same 224 words about why hill forts were built fits every hill fort article because that's the state of the archaeological evidence. It belongs to all of them equally — and to none of them exclusively.
"Strip away every repeated passage and what remains is the genuinely unique part of Wikipedia — the residue that no other article touches."
What would Wikipedia look like if you subtracted all duplicated text? You'd lose the minor planet preamble from 418 articles. The census note from 2,920 Slovak pages. The Andor origin story from ten episode articles. The proteasome paragraph from 26 gene pages. You'd be left with the actual unique contribution each article makes — the part written for exactly one context and no other.
That residue — the non-repeated fraction — might be the most accurate measure of how much Wikipedia actually knows. Not 82 billion characters. Something considerably smaller, and considerably more specific.
As for the football match: a group of chemistry students played in Zürich in the winter of 1899, before an audience of perhaps twenty people, under circumstances that argued against the game being played at all. They won anyway. A century and a quarter later, their story appears — word for word — in eleven separate corners of the largest reference work humanity has ever produced. The Allmend may have been frozen. The crowd may have been sparse. But that afternoon, it turns out, was not going to stay small.
Behind the Analysis: How an AI Agent Worked Through 35 Gigabytes
Finding the most-repeated long passages in 7.6 million Wikipedia articles is not a job for a weekend script. Comparing every article to every other would require 28.8 trillion pairs — at typical computing speeds, roughly 23,000 years of work. This analysis was done differently: by handing the problem to an AI coding agent, watching it work, and redirecting it once.
The tool was Codex, OpenAI's terminal-based coding agent (similar in spirit to this very tool, Claude Code), running the GPT-5.5 model in fully autonomous "yolo" mode — it could write code, execute it, read the output, and iterate without pausing to ask for permission at any step. The full session ran from 09:15 to 11:35 UTC on 26 May 2026: about two hours and twenty minutes, from the first prompt to verified results.
ProcessPoolExecutor and DuckDB.words × log₂(articles + 1) — rewarding text that is both long and widespread rather than gaming either dimension. The top 500 were saved; the top 30 written up in lcs.md with verified article counts.Why Not Just Compare Articles Directly?
The most natural tool for finding shared text between two documents is pairwise comparison: take article A, compare it character-by-character against article B, find the longest common run, repeat. A Python library called pylcs does exactly this in compiled C++ code, using dynamic programming — the same technique used in DNA sequence alignment and diff tools. It's fast for two documents. It does not scale to 7.6 million of them.
The agent benchmarked four approaches on a sample of 2,000 real downloaded articles. Press Start below to see what each method's speed means in practice, with the clock running at 60× compression (one animation second = one real minute):
What Is Fingerprinting?
A fingerprint (or hash) is a short, fixed-length number computed from a piece of text. The rule is simple: identical inputs always produce identical outputs, and any change — even one character — produces a completely different output. It's like a library catalogue number that is uniquely and deterministically computed from the book's contents.
a3f8c2d1b0e74591a3f8c2d1b0e74591 ✓ same7b2e9f4c1a830d56 ✗ differentThe algorithm used is BLAKE2b, a cryptographic hash function designed for speed. The same algorithm is used to verify file downloads and sign software packages. Here's the actual code:
import hashlib
def fingerprint(text: str) -> int:
# BLAKE2b with 8-byte output = a 64-bit integer fingerprint
digest = hashlib.blake2b(text.encode(), digest_size=8).digest()
return int.from_bytes(digest, "little")
# Two identical sentences in different articles → same fingerprint
text_a = "Hill forts developed in the Late Bronze Age..."
text_b = "Hill forts developed in the Late Bronze Age..." # different article
fingerprint(text_a) == fingerprint(text_b) # True ← match found!
# One word different → completely different fingerprint
text_c = "Hill forts developed in the Early Bronze Age..."
fingerprint(text_a) == fingerprint(text_c) # False
The key advantage: instead of comparing 7.6 million articles pairwise (28.8 trillion operations), you compute one fingerprint per paragraph per article — roughly 190 million fingerprints total — and then let a database count which fingerprints appear in multiple articles. Finding duplicates in a list of 190 million numbers takes seconds. Finding shared substrings across 7.6 million article pairs takes millennia.
The fragment extraction uses regular expressions to split articles on sentence boundaries, keeping only long fragments (at least 70 characters, at least 10 words) that are worth tracking:
import re, hashlib
# Split on sentence-ending punctuation followed by a capital letter
SENTENCE_RE = re.compile(r'(?<=[.!?])\s+(?=[A-Z0-9])')
def extract_fragments(article_text: str):
"""Yield (fingerprint, sentence) for every long sentence in an article."""
for sentence in SENTENCE_RE.split(article_text):
sentence = sentence.strip()
if len(sentence) < 70: # too short to be interesting
continue
if len(sentence.split()) < 10: # fewer than 10 words
continue
yield fingerprint(sentence), sentence
# For each of 7,597,149 articles across 86 Parquet shards,
# this produces ~25 fragments on average → ~190 million total.
How DuckDB Finds the Matches
Once all 190 million fingerprints are written to intermediate Parquet files (one file per shard, 6 workers in parallel), DuckDB takes over. DuckDB is an in-process analytical database — think of it as a spreadsheet engine that can read billions of rows from files without loading them all into memory at once. A single SQL query aggregates all 190 million fingerprint records and surfaces only those that appear in ten or more distinct articles:
-- Read all 190 million fingerprint records from the fragment files
SELECT
hash,
COUNT(DISTINCT article_id) AS articles, -- distinct articles containing it
SUM(occurrences) AS total_hits,
MAX(word_count) AS words,
MAX(words) * LOG2(COUNT(*) + 1) AS score -- interestingness formula
FROM read_parquet('analysis/lcs_fragments/*.parquet')
GROUP BY hash
HAVING COUNT(DISTINCT article_id) >= 10 -- appears in at least 10 articles
ORDER BY score DESC
LIMIT 500; -- keep top 500 candidates
This runs in seconds. DuckDB reads only the columns it needs (hash, article_id, word_count) and aggregates on the fly, never loading all 190 million rows into memory simultaneously. A second pass over the original Parquet shards then retrieves the actual text and example article titles for the top 500 hashes — since the intermediate files store only fingerprints, not the original sentences.
The two-pass design is deliberate. Pass 1 is CPU-intensive (hashing) and easily parallelised across 6 workers; Pass 2 is I/O-intensive (reading large Parquet files) and benefits from DuckDB's columnar reader. Together they process 34.6 gigabytes in about 90 minutes — roughly the speed of a modern SSD scan.
What Makes a Passage "Interesting"?
The challenge with ranking repeated passages is that raw frequency and raw length both fail on their own. A 2-word phrase like "see also" appears in nearly every Wikipedia article — but it's useless. A 500-word passage appearing in exactly two articles may be impressive but hard to call a pattern. The score needs to reward both dimensions simultaneously.
The formula chosen was score = words × log₂(articles + 1). The logarithm dampens the article count — going from 10 to 100 articles matters a lot, but going from 1,000 to 10,000 matters less — while the word count multiplier ensures that longer passages score much higher than shorter ones with the same reach.
import math
def score(words: int, articles: int) -> float:
"""Higher is more interesting. Rewards length AND frequency."""
return words * math.log2(articles + 1)
# Rank #1 — The asteroid preamble (213 words, 418 articles):
score(213, 418) # → 1855.4 ← top of the list
# A 100-word passage in 100 articles:
score(100, 100) # → 667.0 ← much less interesting
# A 2-word phrase in 1 million articles:
score(2, 1_000_000) # → 39.9 ← nearly zero — too short to matter
# A 500-word passage in only 2 articles:
score(500, 2) # → 792.5 ← decent score but not top-30 without more reach
"The AI didn't just run the analysis — it chose the right method. The benchmark detour was the agent's own interpretation of what 'done' means: verify the approach before committing to an hours-long run."
What AI Coding Agents Make Possible
Four Python scripts were written, linted, and executed over the course of this session: one to download the dataset, one to profile its schema, one to benchmark the three candidate algorithms, and one to run the full two-pass mining pipeline. The agent also used uv to manage Python environments, pyarrow to read Parquet files, pylcs for the benchmark comparison, tqdm for progress bars, and DuckDB for SQL aggregation. All autonomously installed, called, and verified.
A skilled data engineer could produce the same pipeline in a day or two of focused work. The AI agent produced it in 140 minutes — including the benchmarking detour that a human might skip to save time. One person with one tool and one afternoon.
The Wikipedia corpus is not unusual in its scale. Government records, court filings, scientific literature, and financial disclosures all exist at similar sizes. What this session demonstrated is that the barrier between "I wonder if this pattern exists in this data" and "here are the top 30 verified examples with article counts and diversity scores" has dropped to a single conversation. The ideas, the judgment, the interpretation: those still belong to the human. The four scripts, the three benchmarks, the six parallel workers, and the 190 million fingerprints: those belong to the machine.
The 213-word asteroid preamble — appearing word for word across 418 Wikipedia articles — was always there. It took a two-hour autonomous sweep through 190 million text fragments to find it.