A rigorous analysis of which domains hold Wikipedia together β and which ones, if they vanished tomorrow, would leave thousands of articles with nothing to stand on.
Imagine you run a small statistics bureau in Warsaw. Your website has about forty thousand pages of census data β population figures, demographic breakdowns, the usual. It's not glamorous. You don't have a Twitter account. Nobody sends you fan mail. But one Wednesday morning, you decide to take the server down for maintenance, and accidentally leave it down.
In that moment, without fanfare or warning, 54,000 Wikipedia articles lose their only source of evidence.
Not a source. Not their best source. Their only source.
This is the story of how Wikipedia's knowledge rests on a surprisingly thin and uneven foundation β and how the domains that matter most aren't the ones you'd guess. It's a story about single points of failure hiding in plain sight, and about the difference between a source that's everywhere and a source that's irreplaceable.
The obvious assumption: big websites hold up Wikipedia. The New York Times, the BBC, archive.org, Google. They're cited constantly, so surely if one of them disappeared, Wikipedia would collapse.
Let's test that assumption with actual numbers.
archive.org is cited in nearly three million Wikipedia articles β the single biggest domain on the entire platform, bigger than Google, bigger than the New York Times, bigger than the BBC.1 But if archive.org evaporated tomorrow, only about one in ten of those articles would be left completely without citations. The other 90% have backup sources. Wikipedia's heavy hitters are, it turns out, well-distributed.
This isn't an accident. The Guardian publishes thousands of stories. The New York Times covers every angle. BBC News is everywhere. These sources appear on so many different topics that no single article is likely to depend on them exclusively. They're diverse. They're resilient.
But then there's stat.gov.pl.
The Central Statistical Office of Poland runs a website, stat.gov.pl, that publishes population and demographic data for Polish municipalities. It's a functional, boring, government statistics portal. It has no Wikipedia article of its own. Nobody is writing think-pieces about it.
It is also cited on 54,465 Wikipedia articles, and if it disappeared, 45,196 of those articles β 83% β would be left with zero citations.
Why? Because there are roughly 50,000 Wikipedia articles about Polish towns, villages, and administrative divisions. Articles like "Gmina Wielka WieΕ" or "Kolonia SobiesΔki." Each of these articles exists primarily to report population data. And the source for that population data, in 83% of cases, is exactly one website: stat.gov.pl.
The Polish village of Ruda Maleniecka has 1,247 inhabitants. We know this because Wikipedia says so. Wikipedia says so because stat.gov.pl says so. If stat.gov.pl goes away β a restructuring, a domain change, a budget cut, a minister who doesn't see why the old URLs need to stay β then Ruda Maleniecka's Wikipedia article has nothing to stand on. It becomes a claim without evidence. A fact without a source.
This is the structure of fragile knowledge: not one fragile source, but a whole category of articles all depending on a single domain that nobody notices until it's gone. It is a single point of failure for an entire slice of human knowledge.
The Citation Landscape: 500 Domains, Ranked by Reach and Dependence
Each rectangle is a domain. Size = total Wikipedia pages that cite it. Color = share of those pages that would lose all citations if this domain vanished (dependence ratio), scaled from the lowest to the highest value at each threshold. Drag the slider to change the threshold; group by category for sector patterns. Hover cells for details.
Drag the threshold slider past 30% and watch what happens. A new pattern emerges: an entire sector lights up. Biodiversity databases.
marinespecies.org (the World Register of Marine Species) is cited on 42,000 Wikipedia articles. Nearly 40% of those articles cite it as their only source. biolib.cz, a Czech biodiversity library, is cited on 24,000 articles β and 68% of them rely on it exclusively. There's afromoths.net, which catalogs African moths; funet.fi, a Finnish network that hosts butterfly and moth taxonomy data.2 These are small, specialized databases maintained by researchers who love moths and fish and plants and beetles, and they have quietly become the sole evidentiary foundation for tens of thousands of Wikipedia articles.
The pattern makes sense, if you think about it. When Wikipedia editors write articles about species β Heliothis subflexa, say, or the Cape Fur Seal β they go to the canonical database for that species group. Not because they're lazy, but because that's where the authoritative information lives. There is one World Register of Marine Species. There is not a competing register, a backup register, a register-with-redundancy. There is one database, and if it goes offline, the Wikipedia article for every marine species that database covers is suddenly unverified.
This is not a criticism of these databases. They are doing essential, difficult, underfunded work. The IUCN Red List, the Catalogue of Life, the GBIF β these are the authoritative records of life on Earth. But their existence as singular authorities makes every Wikipedia article that cites them structurally dependent in a way that no news site, however widely read, ever creates.
The sports sector has its own version of this issue, but with a different flavor.
sports-reference.com is cited on 104,000 Wikipedia articles β more than any newspaper β and 37,000 of those pages, 36%, cite it as their only external source. These are articles about athletes and games and records, and sports-reference.com has built the canonical statistical database for American sports. It's extremely good at what it does. It also means that 37,000 Wikipedia articles about sports history are a single domain shutdown away from having nothing to stand on.
The same logic applies to espncricinfo.com (26% dependence for 31,000 articles) and olympedia.org (19% for 32,000 articles). Sports stats live in purpose-built databases, and those databases are not replicated.
Compare this to The Guardian, cited on 203,000 articles, where only 0.6% of those articles rely on it exclusively. The Guardian covers general news. Sports databases cover a specific domain with unique data. The difference isn't just scale β it's the structure of the knowledge itself.
We should return to archive.org, because it's instructive in exactly the opposite way.
archive.org is cited on 2.89 million Wikipedia articles. That's roughly 38% of all English Wikipedia articles. It's an astonishing presence. And yet, if archive.org disappeared, only 9.8% of those articles would lose all their citations. The reason is structural: archive.org links are often used as archived backups of other URLs. An article cites a newspaper story, and then also cites an archive.org snapshot of that same story, as insurance against link rot. If archive.org vanishes, the original newspaper link is still there. The article still has a source.
This is brilliant and unintentional design. archive.org became Wikipedia's insurance policy against the rest of the internet dying, and the side effect is that archive.org itself is not a single point of failure. It's a redundancy layer on top of redundancy.
Google.com is the second-most-cited domain on Wikipedia after archive.org, appearing on 830,000 pages. This is mostly Google Books links and Google Scholar citations. If Google disappeared, 45,000 articles β 5.4% of those citing Google β would have zero remaining citations. That's more than 40,000 articles citing only Google for their evidence.
The web is full of link rot. Studies suggest that roughly half of all URLs cited in academic papers eventually stop working. For Wikipedia, which uses web citations heavily, this is an ongoing maintenance problem β and archive.org exists precisely to address it. But the problem revealed by this analysis is different: it's not about broken links. It's about structural dependence.
There are specific categories of knowledge β Polish municipality demographics, marine species taxonomy, moth classifications, sports statistics β where Wikipedia has systematically centralized its evidence base on a small number of specialist databases. Those databases are often small, lightly funded, run by researchers or government agencies, and not set up with continuity guarantees. They're not archive.org. They don't have millions of dollars and a nonprofit mandate. They're a professor's side project, or a government ministry's data portal, or a specialist society's online catalog.
If any of those domains vanishes β or restructures its URLs, or moves behind a paywall, or just forgets to renew its TLS certificate β a corresponding segment of Wikipedia becomes unverifiable overnight.
The most resilient knowledge lives in overlapping, redundant sources: news stories covered from ten angles, scientific findings documented in multiple papers, historical events recorded in archives and books and court documents. The most fragile knowledge is the knowledge that exists, authoritatively, in exactly one place: the only census of a small country's villages, the only register of a taxonomic group, the only database of a sport's career statistics.
That's not bad design. That's just the structure of reality. Some knowledge is only in one place. The question is whether Wikipedia β or anyone else β has thought seriously about what happens when that one place goes away.4
Domains with β₯ 10,000 citing pages, sorted by share that rely on each domain exclusively. Click domain names to visit their sites.
| Domain | Source category | Taxonomy | Total citing pages | Exclusively dependent | Dependence ratio |
|---|