Wikipedia · Data Analysis · May 2026

The Paragraph That
Appears 418 Times

We compressed all of English Wikipedia — 7.6 million articles — into 35 gigabytes and searched for the longest text that echoes through the most pages. What we found reveals the hidden skeleton of human knowledge.

↓ scroll to explore

On the fifth of March, 1899, a group of English-speaking students in Zürich arrived at the Allmend — a military commons they used as a football pitch — and found an official notice pinned to the gate. The military directorate of the canton had banned all play on the training ground until further notice. It was too late to reach anyone. The age of mobile phones was still a century away. The match against FC Basel went ahead anyway, with only seven men on the Zürich side.

Players trickled in as the game progressed. The referee, scheduled for the afternoon, never appeared — no one had told him the kickoff had been brought forward to the morning. The crowd, according to a contemporary account, consisted of "approximately 10 to 20 spectators." By the report's own judgment: under such circumstances, such an important match should not have been played.

The Anglo-American Football Club of Zürich — chemistry students at the Federal Polytechnic School, the institution that would become ETH Zürichwon 10 to nothing. Their center forward, Robert Collinson, scored eight of those goals himself. The club went on to win Switzerland's first-ever official national football championship that season. They had also, six years earlier, been present at the founding of the Swiss Football Association itself.

That story — all 316 words of it — lives in the Wikipedia article for Rudolf Schwarz. And in the article for Ernst Gass. And Hans Billeter, and Georges Fürstenberger, and R. Sommer, and Ernst-Alfred Thalmann, and Adolf Rittmann, and Emanuel Schiess. Eleven different players, eleven different Wikipedia pages, one word-for-word identical account of a single winter afternoon 127 years ago.

"Eleven articles sharing 316 words about a match played before a crowd of 10 to 20 spectators."

This seemed extraordinary — until we found the paragraph that appears not eleven times, but 418.

Every time a new range of asteroids gets officially named, a Wikipedia editor creates an article. The article might be called "Meanings of minor-planet names: 269001–270000" or "Meanings of minor-planet names: 380001–381000." There are 418 such pages, tracking the names of numbered asteroids from the tens of thousands up past 380,000. And at the top of every single one sits the same 213-word paragraph. It explains how the International Astronomical Union's Minor Planet Center assigns permanent numbers to space rocks. It cites the JPL Small-Body Database and the IAU Working Group for Small Bodies Nomenclature. It mentions a German astronomer named Lutz D. Schmadel — born 1942 in Berlin, died October 21, 2016, Heidelberg — who spent his career at the Astronomisches Rechen-Institut compiling every naming citation into his Dictionary of Minor Planet Names, edition after edition, until the end of his life.

The same 213 words. In 418 separate articles. This is what the data turns up when you ask: what is the longest passage in all of English Wikipedia that appears in the most articles?

A 35-Gigabyte Mirror of Everything

The Wikimedia Foundation publishes all of English Wikipedia as structured data on Hugging Face — 86 Parquet files, a columnar format built for large-scale queries. Downloaded, they occupy 34.61 gigabytes. Inside: 7.6 million rows, one per article, with structured fields for the abstract, sections, references, infoboxes, tables, and editor metadata.

The sections field — a JSON blob containing the full article body — is where the text lives. The median article runs about 5,000 characters. The shortest are stubs of three characters; the longest, a list of government schools in New South Wales, approaches a million. Seventeen percent of the most recent edits were made by bots.

To find what echoes most, we extracted 190 million text fragments — every paragraph-length and sentence-length piece of text — and hashed each one. For each hash, we counted how many distinct articles it appeared in. Then we ranked by a score combining length and frequency: words × log₂(articles + 1). Long text in many articles scores highest. Short text even in thousands of articles scores less. The top 30 results are what you see below.

7.6MArticles
35 GBCompressed
190MFragments scanned
418×Most repeated passage

The Diversity Spectrum

The 30 most-repeated long passages range from pure boilerplate — an 81-word census note copied into 2,920 Slovak village stubs — to something stranger: 316 words about an 1899 football match, shared across eleven individual player articles. Each circle is one passage. Horizontal position shows article diversity (how different the carrier articles are from each other). Vertical position is the interestingness score. Circle size scales with article count. Click any circle to read the passage.

Template — identical stubs
Series — numbered or alphabetical lists
Geographic — places in a shared region
Thematic — different subjects, shared science
Biographical — different individuals

Five Stories Inside the Data

Most of what the analysis surfaces is expected: Wikipedia has hundreds of nearly-identical pages about Spanish Senate constituencies, Ohio townships, and NFL playoff seasons, and they share boilerplate explanations of their respective systems. The interesting exceptions are the passages that shouldn't repeat — that appear in contexts so different you'd never predict they'd share a sentence.

#1 by score
The Asteroid Preamble
418 articles · 213 words · Score 1,855 · Series tier

The highest-scoring repeated passage is also the most predictable: the standard introductory paragraph for every "Meanings of minor-planet names" article. Wikipedia has 418 such pages, one for each thousand-asteroid range from the low thousands to past 380,000. Every one opens with the same 213-word explanation of how the International Astronomical Union assigns permanent numbers, how discoverers can then propose names, and how Lutz D. Schmadel compiled all the naming citations into his dictionary at the University of Heidelberg's Astronomisches Rechen-Institut until his death in October 2016.

The paragraph is doing necessary work: it contextualizes what these articles are. But its 418-way repetition also points to the scale of solar system cataloguing. The Minor Planet Center has now confirmed more than 600,000 numbered objects. The naming articles cover only a fraction — but already generate 418 near-identical introductions, and the count grows with every new discovery batch.

As minor planet discoveries are confirmed, they are given a permanent number by the IAU's Minor Planet Center (MPC), and the discoverers can then submit names for them, following the IAU's naming conventions. The list below concerns those minor planets in the specified number-range that have received names, and explains the meanings of those names. Official naming citations of newly named small Solar System bodies are approved and published in a bulletin by IAU's Working Group for Small Bodies Nomenclature (WGSBN). Before May 2021, citations were published in MPC's Minor Planet Circulars for many decades. Recent citations can also be found on the JPL Small-Body Database (SBDB). Until his death in 2016, German astronomer Lutz D. Schmadel compiled these citations into the Dictionary of Minor Planet Names (DMP) and regularly updated the collection. Based on Paul Herget's The Names of the Minor Planets, Schmadel also researched the unclear origin of numerous asteroids, most of which had been named prior to World War II.
#12 · most diverse
The Zurich Game
11 articles · 316 words · Score 1,133 · Biographical tier

The most surprising repeated passage is also one of the longest: a 316-word account of the Anglo-American Football Club's 1898–99 season — banned practice ground, missing referee, seven-man start, and the 10–0 rout that announced Switzerland's future champions to anyone who cared to watch. The club, formed by chemistry students at the Federal Polytechnic School, had attended the founding meeting of the Swiss Football Association in 1895 and won its first national title in 1899.

This passage doesn't appear in an article about the club. It appears in the Wikipedia pages for eleven individual players connected to that season — their articles otherwise distinct in career paths, biographical detail, and club affiliation. They share, word for word, this one episode of collective memory: a match played before 10 to 20 spectators that became the founding myth of Swiss football.

What makes this the most interesting entry isn't its score (modest, at 1,133) but its position on the diversity axis — the furthest right point on the chart. Eleven separate biographical subjects sharing a single narrative passage. It's the closest Wikipedia comes to a campfire story.

A curiosity in their 1898–99 season was the friendly game in Zürich on 5 March 1899. The majority of them English students, had formed a club and the members of the Anglo-American Club even attended the founder meeting of the Swiss Football Association (ASF-SFV) in April 1895. They had found a place to play their games, although the Zurich commons was by no means ideal. It was often that the players found the grounds very sludgy or with freshly raised molehills. But at least, it was a homestead that was soon called "Anglo-Platz". Suddenly the announcement: "By decree of the military directorate of the canton of Zurich it is forbidden until further notice to play on the military training area Allmend". The following could be read about the game against FC Basel which was brought forward from the afternoon to the morning: "As a result, the Anglos, who were only partially able to notify their people, started the game with only seven men. The appointed referee was not there because he been scheduled for the afternoon. The crowd consisted of approximately 10 to 20 spectators. Under such circumstances, such an important match should not have been played." Despite all the obstacles: The game became a demonstration of the superiority of the British players from Zurich. The Anglo American Football Club won the match 10–0, with their center forward Robert Collinson alone scoring 8 goals. By then, at the latest, it was clear that the Anglos would be unstoppable on their way to the title.
#25 · widest reach
The Slovak Census Note
2,920 articles · 81 words · Score 933 · Template tier

The passage with the single largest article count is not the most interesting text we found. It is a methodological footnote: an 81-word note explaining why the population figure at the top of a Slovak municipality article may differ from the census figure in the table below. The reason: a student may be officially registered in their home village but spend most of their time studying in the city.

This note appears in 2,920 Wikipedia articles about Slovak villages and municipalities — roughly one for every settlement in the country. The entire Slovak municipal presence on Wikipedia is connected by a single paragraph about the limitations of permanent residence data. It is, in a way, the purest expression of the phenomenon: a sentence of bureaucratic clarification, copied faithfully across thousands of nearly identical stubs, doing invisible consistency work at national scale.

Note on population: The difference between the population numbers above and in the census (here and below) is that the population numbers above are mostly made up of permanent residents, etc.; and the census should indicate the place where people actually mainly live. For example, a student is a citizen of a village because they have permanent residence there (they lived there as a child and has parents), but most of the time he studies at a university in the city.
#2 · high score + high diversity
The Proteasome Shadow
26 articles · 307 words · Score 1,460 · Thematic tier

The second-highest-scoring passage is the most surprising outlier in the chart: a 307-word explanation of the ubiquitin-proteasome system (UPS) — the cellular mechanism that disposes of damaged or misfolded proteins — appearing across 26 different gene articles. PSMA3, PSMB1, PSMA5, PSMD12, PSMB3, PSMC1: each is a different molecular component of the proteasome, with different functions and different disease associations. Their Wikipedia articles are genuinely distinct. But they all share nearly 2,250 characters explaining UPS's role in cancer, neurodegeneration, Alzheimer's, Parkinson's, ALS, Huntington's disease, and inflammatory response.

This is science boilerplate at its most pervasive, and also its most understandable. Every writer who touches a proteasome subunit article reaches for this paragraph because it efficiently establishes why the UPS matters. The problem: a factual update to any of those 307 words now needs to be made 26 times, in 26 separate places, or quietly diverges across articles until no one is sure which version is current.

Several experimental and clinical studies have indicated that aberrations and deregulations of the UPS contribute to the pathogenesis of several neurodegenerative and myodegenerative disorders, including Alzheimer's disease, Parkinson's disease and Pick's disease, Amyotrophic lateral sclerosis (ALS), Huntington's disease, Creutzfeldt–Jakob disease, and motor neuron diseases, polyglutamine (PolyQ) diseases, Muscular dystrophies and several rare forms of neurodegenerative diseases associated with dementia. As part of the ubiquitin–proteasome system (UPS), the proteasome maintains cardiac protein homeostasis and thus plays a significant role in cardiac ischemic injury, ventricular hypertrophy and heart failure. Additionally, evidence is accumulating that the UPS plays an essential role in malignant transformation...
#23
The Andor Backstory
10 articles · 281 words · Score 972 · Thematic tier

The repetition problem is not confined to academic or governmental content. A 281-word history of how the Disney+ series Andor came to be — from Bob Iger's February 2018 announcement through Tony Gilroy replacing Stephen Schiff as showrunner, to the COVID-19 delay that forced Gilroy to manage production remotely from New York — appears in ten separate Wikipedia articles, one per episode of the show's first season.

Each episode article covers a different story, a different setting, a different moment in the narrative. But all ten share the identical origin paragraph in their "Background" sections. It's a reminder that Wikipedia's serialized episode-page structure naturally generates this pattern: the same contextual paragraph, pasted before each new chapter of any long-running work.

Disney CEO Bob Iger announced in February 2018 that there were several Star Wars series in development, and that November one was revealed as a prequel to the film Rogue One (2016). The series was described as a spy thriller show focused on the character Cassian Andor, with Diego Luna reprising his role from the film. Jared Bush originally developed the series, writing a pilot script and series bible for the project. By the end of November, Stephen Schiff was serving as showrunner and executive producer of the series. Tony Gilroy, who was credited as a co-writer on Rogue One and oversaw extensive reshoots for the film, joined the series by early 2019 when he discussed the first story details with Luna...

The Residue of Knowledge

Think of Wikipedia as a coral reef. From above, it looks like 7.6 million individual organisms — each article its own shape and color, its own story. But underneath, they share a skeleton. The repeated passages we found are that skeleton: paragraphs doing the connective work that keeps the encyclopedia coherent, establishing context every time a new article touches a common system.

Most of that skeleton is dull infrastructure. The same note about Slovak census methodology in 2,920 village pages. The same township governance paragraph in 1,197 Ohio articles. The same explanation of the D'Hondt method in 110 Spanish constituency articles. The same Judiciary Act of 1789 text in 309 Supreme Court case lists. This is how encyclopedias work — and how they must work, to be consistent at scale.

But some of it is more interesting. The hill fort passage, shared across 33 different ancient British earthworks — each a different place, each containing archaeologist Barry Cunliffe's argument about iron-age social stress and shifting trade routes. The superheavy element passage, explaining the island of stability theory, appearing in 25 element articles from Nihonium to Lawrencium. And the 1898–99 football story, living in eleven separate biographical pages — the most diverse set of carrier articles in the entire top 30.

These aren't just boilerplate. They're moments where editors writing about genuinely different things reached for the same explanation, because it was the right one. The same 224 words about why hill forts were built fits every hill fort article because that's the state of the archaeological evidence. It belongs to all of them equally — and to none of them exclusively.

"Strip away every repeated passage and what remains is the genuinely unique part of Wikipedia — the residue that no other article touches."

What would Wikipedia look like if you subtracted all duplicated text? You'd lose the minor planet preamble from 418 articles. The census note from 2,920 Slovak pages. The Andor origin story from ten episode articles. The proteasome paragraph from 26 gene pages. You'd be left with the actual unique contribution each article makes — the part written for exactly one context and no other.

That residue — the non-repeated fraction — might be the most accurate measure of how much Wikipedia actually knows. Not 82 billion characters. Something considerably smaller, and considerably more specific.

As for the football match: a group of chemistry students played in Zürich in the winter of 1899, before an audience of perhaps twenty people, under circumstances that argued against the game being played at all. They won anyway. A century and a quarter later, their story appears — word for word — in eleven separate corners of the largest reference work humanity has ever produced. The Allmend may have been frozen. The crowd may have been sparse. But that afternoon, it turns out, was not going to stay small.

Behind the Analysis: How an AI Agent Worked Through 35 Gigabytes

Finding the most-repeated long passages in 7.6 million Wikipedia articles is not a job for a weekend script. Comparing every article to every other would require 28.8 trillion pairs — at typical computing speeds, roughly 23,000 years of work. This analysis was done differently: by handing the problem to an AI coding agent, watching it work, and redirecting it once.

The tool was Codex, OpenAI's terminal-based coding agent (similar in spirit to this very tool, Claude Code), running the GPT-5.5 model in fully autonomous "yolo" mode — it could write code, execute it, read the output, and iterate without pausing to ask for permission at any step. The full session ran from 09:15 to 11:35 UTC on 26 May 2026: about two hours and twenty minutes, from the first prompt to verified results.

The prompt — given verbatim to the AI agent"Download the Wikipedia Parquet dataset from hf://datasets/wikimedia/structured-wikipedia/enwiki/data/*.parquet. Document what it contains in schema.md. Include: how many rows, what are the columns, interesting things should we document about the volume, diversity, quality, etc. Then, identify the longest common substring with the highest frequency. What I mean is, what are the long strings that appear across a large number of articles? 2–3 word phrases are not of interest. 100-word phrases that appear in just two articles are of very mild interest. In other words, we want a combination of long AND frequent as the 'interestingness' metric, so to speak. Document this in lcs.md with the 30 most interesting LCSs, along with where they appear and how many times."
How the analysis unfolded — 5 steps, ~2 hours 20 minutes
Step 1 · ~17 min
Download 35 gigabytes
The agent used huggingface_hub to fetch all 86 Parquet shards of the wikimedia/structured-wikipedia dataset from Hugging Face using 8 parallel connections. It verified the exact byte total (37,163,742,694 bytes) against the registry before proceeding.
Step 2 · immediate
Recognise the scale problem
With 7,597,149 articles, a brute-force "compare every pair" approach requires 28.8 trillion comparisons. The agent recognised this immediately and began drafting a profiling script and a repeatable mining script in parallel while the download was still running.
Step 3 · human steer
One redirect mid-session
The user added a single follow-up prompt: "Explore pylcs for speed. Research the fastest approaches, test them out, pick the best." The agent immediately wrote a benchmark script, ran it against real downloaded shards, and committed to the fastest method.
Step 4 · ~90 min
Mine 190 million fingerprints
The winning approach: extract every long paragraph and sentence from every article, compute a unique fingerprint for each, then count fingerprints shared across 10+ articles — across all 86 shards in parallel, using ProcessPoolExecutor and DuckDB.
Step 5 · <1 min
Rank by interestingness
Each candidate was scored by words × log₂(articles + 1) — rewarding text that is both long and widespread rather than gaming either dimension. The top 500 were saved; the top 30 written up in lcs.md with verified article counts.

Why Not Just Compare Articles Directly?

The most natural tool for finding shared text between two documents is pairwise comparison: take article A, compare it character-by-character against article B, find the longest common run, repeat. A Python library called pylcs does exactly this in compiled C++ code, using dynamic programming — the same technique used in DNA sequence alignment and diff tools. It's fast for two documents. It does not scale to 7.6 million of them.

The agent benchmarked four approaches on a sample of 2,000 real downloaded articles. Press Start below to see what each method's speed means in practice, with the clock running at 60× compression (one animation second = one real minute):

The Speed Race — analyzing all 7.6 million Wikipedia articles
Animation at 60× speed  ·  1 second shown = 1 real minute of computation  ·  Elapsed: 0:00
pylcs — pairwise O(n²) 39 pairs / sec  ·  needs 28.8 trillion comparisons
virtually no movement — 23,400 years remaining
pylcs — one→many O(n²) 22 comparisons / sec  ·  needs 28.8 trillion comparisons
virtually no movement — 41,500 years remaining
Token 12-shingles O(n) 1,284 articles / sec  ·  needs 7.6 million articles
Exact fragment hashing ✓ O(n) 3,826 articles / sec  ·  needs 7.6 million articles
O(n²) methods must compare every article to every other — 28.8 trillion pairs for 7.6M articles O(n) methods scan each article once — 7.6 million total operations

What Is Fingerprinting?

A fingerprint (or hash) is a short, fixed-length number computed from a piece of text. The rule is simple: identical inputs always produce identical outputs, and any change — even one character — produces a completely different output. It's like a library catalogue number that is uniquely and deterministically computed from the book's contents.

"Hill forts developed in the Late Bronze Age…"
a3f8c2d1b0e74591
"Hill forts developed in the Late Bronze Age…" (identical copy in another article)
a3f8c2d1b0e74591✓ same
"Hill forts developed in the Early Bronze Age…" (one word changed)
7b2e9f4c1a830d56✗ different
This is called the avalanche effect: a tiny change cascades into a completely unrecognisable output.

The algorithm used is BLAKE2b, a cryptographic hash function designed for speed. The same algorithm is used to verify file downloads and sign software packages. Here's the actual code:

fingerprinting — Python
import hashlib

def fingerprint(text: str) -> int:
    # BLAKE2b with 8-byte output = a 64-bit integer fingerprint
    digest = hashlib.blake2b(text.encode(), digest_size=8).digest()
    return int.from_bytes(digest, "little")

# Two identical sentences in different articles → same fingerprint
text_a = "Hill forts developed in the Late Bronze Age..."
text_b = "Hill forts developed in the Late Bronze Age..."  # different article

fingerprint(text_a) == fingerprint(text_b)   # True  ← match found!

# One word different → completely different fingerprint
text_c = "Hill forts developed in the Early Bronze Age..."

fingerprint(text_a) == fingerprint(text_c)   # False

The key advantage: instead of comparing 7.6 million articles pairwise (28.8 trillion operations), you compute one fingerprint per paragraph per article — roughly 190 million fingerprints total — and then let a database count which fingerprints appear in multiple articles. Finding duplicates in a list of 190 million numbers takes seconds. Finding shared substrings across 7.6 million article pairs takes millennia.

The fragment extraction uses regular expressions to split articles on sentence boundaries, keeping only long fragments (at least 70 characters, at least 10 words) that are worth tracking:

fragment extraction — Python
import re, hashlib

# Split on sentence-ending punctuation followed by a capital letter
SENTENCE_RE = re.compile(r'(?<=[.!?])\s+(?=[A-Z0-9])')

def extract_fragments(article_text: str):
    """Yield (fingerprint, sentence) for every long sentence in an article."""
    for sentence in SENTENCE_RE.split(article_text):
        sentence = sentence.strip()
        if len(sentence) < 70:          # too short to be interesting
            continue
        if len(sentence.split()) < 10:  # fewer than 10 words
            continue
        yield fingerprint(sentence), sentence

# For each of 7,597,149 articles across 86 Parquet shards,
# this produces ~25 fragments on average → ~190 million total.

How DuckDB Finds the Matches

Once all 190 million fingerprints are written to intermediate Parquet files (one file per shard, 6 workers in parallel), DuckDB takes over. DuckDB is an in-process analytical database — think of it as a spreadsheet engine that can read billions of rows from files without loading them all into memory at once. A single SQL query aggregates all 190 million fingerprint records and surfaces only those that appear in ten or more distinct articles:

aggregation — SQL (DuckDB)
-- Read all 190 million fingerprint records from the fragment files
SELECT
    hash,
    COUNT(DISTINCT article_id)          AS articles,   -- distinct articles containing it
    SUM(occurrences)                    AS total_hits,
    MAX(word_count)                     AS words,
    MAX(words) * LOG2(COUNT(*) + 1)     AS score       -- interestingness formula
FROM read_parquet('analysis/lcs_fragments/*.parquet')
GROUP BY hash
HAVING COUNT(DISTINCT article_id) >= 10  -- appears in at least 10 articles
ORDER BY score DESC
LIMIT 500;                               -- keep top 500 candidates

This runs in seconds. DuckDB reads only the columns it needs (hash, article_id, word_count) and aggregates on the fly, never loading all 190 million rows into memory simultaneously. A second pass over the original Parquet shards then retrieves the actual text and example article titles for the top 500 hashes — since the intermediate files store only fingerprints, not the original sentences.

The two-pass design is deliberate. Pass 1 is CPU-intensive (hashing) and easily parallelised across 6 workers; Pass 2 is I/O-intensive (reading large Parquet files) and benefits from DuckDB's columnar reader. Together they process 34.6 gigabytes in about 90 minutes — roughly the speed of a modern SSD scan.

What Makes a Passage "Interesting"?

The challenge with ranking repeated passages is that raw frequency and raw length both fail on their own. A 2-word phrase like "see also" appears in nearly every Wikipedia article — but it's useless. A 500-word passage appearing in exactly two articles may be impressive but hard to call a pattern. The score needs to reward both dimensions simultaneously.

The formula chosen was score = words × log₂(articles + 1). The logarithm dampens the article count — going from 10 to 100 articles matters a lot, but going from 1,000 to 10,000 matters less — while the word count multiplier ensures that longer passages score much higher than shorter ones with the same reach.

scoring — Python
import math

def score(words: int, articles: int) -> float:
    """Higher is more interesting. Rewards length AND frequency."""
    return words * math.log2(articles + 1)

# Rank #1 — The asteroid preamble (213 words, 418 articles):
score(213, 418)    # → 1855.4   ← top of the list

# A 100-word passage in 100 articles:
score(100, 100)    # → 667.0    ← much less interesting

# A 2-word phrase in 1 million articles:
score(2, 1_000_000)  # → 39.9  ← nearly zero — too short to matter

# A 500-word passage in only 2 articles:
score(500, 2)        # → 792.5  ← decent score but not top-30 without more reach
"The AI didn't just run the analysis — it chose the right method. The benchmark detour was the agent's own interpretation of what 'done' means: verify the approach before committing to an hours-long run."

What AI Coding Agents Make Possible

Four Python scripts were written, linted, and executed over the course of this session: one to download the dataset, one to profile its schema, one to benchmark the three candidate algorithms, and one to run the full two-pass mining pipeline. The agent also used uv to manage Python environments, pyarrow to read Parquet files, pylcs for the benchmark comparison, tqdm for progress bars, and DuckDB for SQL aggregation. All autonomously installed, called, and verified.

A skilled data engineer could produce the same pipeline in a day or two of focused work. The AI agent produced it in 140 minutes — including the benchmarking detour that a human might skip to save time. One person with one tool and one afternoon.

The Wikipedia corpus is not unusual in its scale. Government records, court filings, scientific literature, and financial disclosures all exist at similar sizes. What this session demonstrated is that the barrier between "I wonder if this pattern exists in this data" and "here are the top 30 verified examples with article counts and diversity scores" has dropped to a single conversation. The ideas, the judgment, the interpretation: those still belong to the human. The four scripts, the three benchmarks, the six parallel workers, and the 190 million fingerprints: those belong to the machine.

The 213-word asteroid preamble — appearing word for word across 418 Wikipedia articles — was always there. It took a two-hour autonomous sweep through 190 million text fragments to find it.