In July 2025, in the heavy morning warmth of Thiruvananthapuram, the opening talk at the TeX User Group annual conference began with a crayon drawing.
Not a polished diagram. Not a carefully curated slide. A hand-drawn family tree on crumpled paper, made by a six-year-old girl named Emily, who had a problem to solve. She had told her teacher she had seven grandmothers. The teacher declared it impossible. Emily went home and built a proof.
Rob Schrauwen had been with Elsevier since 1991. He was trained as a mathematician. He'd spent three decades as Chief Content Architect, then VP of Data and Platform Strategy, overseeing the infrastructure that processes and connects tens of millions of scientific articles. When he stood up to open TUG 2025, he pulled out a photograph of a child's drawing and said: this is what I want to talk about.
"She produced this — a drawing that my granddaughter made when she was six. We can learn a lot from it."
— Rob Schrauwen
The drawing was, Rob explained, a masterpiece of knowledge graph design. Emily had established a clear policy: great-grandmothers count as grandmothers. She'd assigned each grandmother a unique circled number — a unique "Oma ID." She'd included men in the picture purely as structural evidence, because a person can only be a grandmother if they are the mother of someone. And she'd focused ruthlessly on the task: only grandmothers are numbered. Nothing else.
Then came the detail that stopped the room. When Emily wasn't certain whether "Hetty" was spelled with a "y" or with "ie," she simply wrote "Oma" — reverting to the generic label rather than risk introducing a spelling error that would undermine her entire case.
"She basically invented continuous data quality or defensive programming. If you're not sure about the facts, you revert to the generic fact. Because if she would have misspelled Hetty, the teacher would have said, 'Oh, this whole story is nonsense because you can't even spell the name of your grandmother.'"
— Rob Schrauwen
The punchline? Emily actually had eight grandmothers. Her own was missing from the tree — Rob was standing right next to "Oma 1" in the drawing. The recall was seven-eighths. Imperfect.
And yet nobody doubted the data. The trust was complete. Trust built not through infallibility, but through transparency of method, clarity of policy, and total focus on the problem at hand.
This trust is based on transparency, clear policies, and total focus on the business problem at hand.
Rob had spent thirty years trying to build exactly this at one of the world's largest scientific publishers. Here it was, in purple crayon, by a six-year-old.
The Machine Behind One Hundred Million Articles
To understand what Rob was trying to say, you need to understand what Elsevier's data platform actually does. Because the scale of it is staggering even by the standards of big tech.
Elsevier's knowledge graph contains 23.8 million full-text articles from ScienceDirect, 102.3 million records from Scopus, 40 million researcher profiles, and 100,000 carefully curated organizations — all linked together. Not just knowledge about research, but knowledge in research: the ideas, methods, findings, and connections that live inside the papers themselves.
How Elsevier's data platform works — five steps, a hundred million articles, all probabilistic
The pipeline that generates all this is five steps: harvest evidence about entities, extract occurrences of those entities, match and cluster them together, form a context view for each product, and make the datasets available. The critical word Rob used again and again: probabilistic. The machine doesn't know for certain. It estimates, with provenance attached.
That insistence on provenance — on recording not just what you know but how you know it — is what makes the system trustworthy. And it's what leads directly to Rob's most fundamental argument.
"It is extremely important that we stay totally true to the source. If some page number says 38 in the article, and it is 37 in PubMed, and maybe we replace it by 37 — I think with that, we have already ruined the entire knowledge graph."
— Rob Schrauwen
Vendors had been doing this "helpfully" for years: silently correcting data that seemed wrong. Rob's position: that's catastrophe. You don't know that PubMed is right. You don't know that the difference isn't meaningful. Stay true to the source. If you must annotate, do it outside the document and mark it clearly as your assertion, not the author's.
From Tagging to Real Tagging
Here is where Rob landed his central insight — and where the talk earned its title.
For decades, scientific publishing has been built on markup: wrapping text in XML tags. <author>, <affiliation>, <reference>. The vocabulary evolved, the DTDs grew elaborate, and enormous effort went into encoding every element of every article into a structured hierarchy. The assumption was: get the structure right, and everything else follows.
Rob's counter: markup is the wrong tool entirely. The right tool is annotation.
"A tag that you stick onto it like a sticky note is far more like a tag than an XML tag. So I say we go from tagging to real tagging — and put a sticky note onto it."
— Rob Schrauwen
Markup, Rob explained, fails in four fundamental ways: it's unintuitive to mark up things that aren't there; it works best only when everything is in a single file; it's impossible to have overlapping markup; and you must change the content itself to fit the schema. Stand-off annotation — placing assertions outside the document — solves all four.
The Key Insight
When annotations live outside the document, multiple people can make simultaneous claims about the same text. You can record what's absent. You can make overlapping assertions. You can capture multiple conflicting opinions and weight them later. And you never touch the original.
Inline XML (what Rob called "markup")
<article>
<author>Pierre and Marie Curie</author>
<affiliation>Departments of Medicine
and Cardiology in India</affiliation>
</article>
One author. One affiliation. Reality ignored in favour of schema.
Stand-off annotation (what Rob proposed)
{
"@context": "https://schema.org",
"@graph": [{
"@id": "#au1",
"@type": "Person",
"name": "Pierre and Marie Curie",
"alternateName": [
{"@type": "Name", "name": "Pierre Curie"},
{"@type": "Name", "name": "Marie Curie"}
]
},{
"@type": "Organization",
"name": "Departments of Medicine and Cardiology",
"location": "India",
"memberOf": {
"@type": "Organization",
"name": "Departments of Medicine and Cardiology"
}
}]
}
Both people recorded. Both departments. Provenance attached. Original text untouched.
Rob had been moving Elsevier toward this model for years, using RDF and what he called the "golden rules for data": always add provenance; respect for zero (mark up what's absent, not just what's present); anchor everything in time; use a common vocabulary; and allow multiple opinions about the same thing.
That last rule is perhaps the most important. In a markup world, there is one slot for "author" and one answer goes there. In an annotation world, you can record that the author said X, PubMed says Y, and an algorithm concluded Z — all simultaneously, with their weights and sources preserved. The corpus decides which to trust.
Thirty Years of Things That Were True but Irrelevant
The title of the talk wasn't merely clever. It was the organizing principle of Rob's entire career narrative — and it's more useful than it first appears.
Being "true but irrelevant" is a specific kind of failure. It's not wrong. It's not stupid. The ideas Rob described were often genuinely insightful, technically correct, and deeply considered by very smart people. They just turned out not to matter in the way anyone expected. The future that arrived was different from the future that was designed for.
The Promises of the 1990s — Did They Materialize?
Three grand ideas that were entirely correct in principle and almost entirely unused in practice.
1
Separate content from presentation
The plan: separate data from style so the same XML file could appear in a thousand forms — different journals, different layouts, different media. The reality: articles never appeared in multiple styles. "Style" and "data" became hopelessly confused. Table column widths (left entirely to typesetters) turned out to matter enormously and were never encoded. Critical metadata every user needed was never consistently present.
True, but irrelevant
2
Re-publish content everywhere
The plan: the same article file would flow into journals, books, and databases interchangeably. The reality: it never happened. Every re-appearance got its own DOI, its own citation details. Files couldn't include the metadata needed to republish them without making them non-republishable. "That makes these files completely useless," Rob said flatly.
True, but irrelevant
3
XML-first everything
The plan: one canonical XML file generates both the online rendering and the PDF. The reality: this required extraordinary standardization. Every journal wanted to be special. DTD changes became impossibly costly — "with the tiniest change to our DTDs, it's extremely complicated." LaTeX was quietly introduced as an exception in 2007. The XML-first ideal never became XML-only practice.
True, but irrelevant
Slide 6 — Three grand promises; three polite disappointments
The Emphasis Nine Problem
One story Rob told is quietly devastating. In the 1980s, Elsevier's predecessors built a DTD with not <bold> and <italic>, but <emphasis1> through <emphasis9>. The vision was semantic purity: if you know that E. coli is a species name, mark it as <species>, not merely italic. The system could then do intelligent things with that semantic information.
The vision was philosophically correct. The execution was practically irrelevant. No author would ever accept that their bold and italic might "swap around." The semantic markup was never used. And Rob's conclusion:
"To add an interpretation can only go wrong. It's far better if we just use italic and annotate on the side if we want to."
— Rob Schrauwen
Eating Your Own Dog Food — Or Rather, Not
Here Rob delivered the confession that got the biggest laugh of the morning. As Chief Content Architect of Elsevier, he was responsible for the Elsevier DTD. He even wrote two books about it — "tag by tag" — 444 pages each, documenting every element of the standard he'd designed.
He wrote those books in LaTeX.
Slide 7 — The Chief Content Architect wrote his documentation in LaTeX, not Elsevier XML
"I was Chief Content Architect of Elsevier, telling everybody you have to be very, very precise on the tagging. But I have never in all those 30 years coded a third-order heading in Word as a third-order heading. I just made it italic and blue, over and over again. And this worked."
— Rob Schrauwen
He published an internal newsletter for twenty-nine years. Hundreds of employees subscribed. He formatted it manually — italic, blue — every single time. No semantic markup. No proper heading structure. And not once did it interfere with anyone's ability to read, use, or benefit from it.
He called this a "big disconnect." It is also, quietly, a piece of evidence: that perfect structure, while intellectually correct, can be practically irrelevant.
Validate on Departure, Ignore on Arrival
One of the sharpest observations in the talk concerns who validates data — and when.
The traditional model: every consumer validates the file against the same schema as the producer. This sounds rigorous. It's a trap. The moment you need to update a DTD, every downstream consumer must update simultaneously. Innovation freezes. A single change requires coordinating dozens of teams.
Slide 8 — The "blood tests" model: thorough validation before sending, selective checking on arrival
"What they should have done: create their own data model, pick from the file what they need, check whatever they like about that — but ignore everything that they don't know."
— Rob Schrauwen
The better model — which Rob called "blood tests" — has producers doing exhaustive checks before dispatch. Consumers pick only what they need and verify only that. The corpus is robust enough to survive some errors, provided you detect and repair them quickly. Perfection at every node isn't the goal; resilience of the whole is.
Modeling the Data, Not the World
Data modelers make a subtle but fatal mistake when they model how they think the world should be, rather than how the data actually is.
Slide 9 — What "affiliations" and "references" actually look like in the wild
Rob showed examples from real published articles. An affiliation field containing "Associate Professor of Pathology" — not an institution at all, but a job title. Another: "Departments of Medicine and Cardiology in India." Another still: a street address. In a reference footnote, a single run-on sentence containing three separate citations packed together.
The data modeler's instinct: these are errors. Clean them up. Force them into the schema. Rob's counter:
"I don't model the words, I model the data. And this is what people do, so let them. True that these are two departments, but irrelevant that we have to split it. And that is all thanks to the fact that we don't structure, but that we annotate."
— Rob Schrauwen
With stand-off annotation, you can point at "Departments of Medicine and Cardiology in India" and assert: two organizations are named here. You never touch the original text. The author's words remain exactly as they wrote them. The semantic content — the two departments, the country, the relationship — lives in the annotation layer, precise and queryable, while the document layer retains its honest messiness.
The Ocean and the Waves
Rob offered a metaphor worth keeping. New articles flowing in are the waves: dramatic, visible, consuming everyone's attention. Underneath is four kilometers of water mass — the corpus of 100 million articles that already exists.
Most data systems are designed for the waves. They process each new item as it arrives. But the ocean matters equally. When a new data point is introduced — a new way of identifying research topics, a new entity resolution algorithm — it's only valuable if it's consistent across the entire corpus, not just the new arrivals.
"A new data point can only be good if it is useful across the corpus. If we are so happy that we have it in this one article, it makes no sense to me if it is not consistently done."
— Rob Schrauwen
This led to one of his more elegant inventions: reproducible identifiers. A paragraph's ID is computed as a hash of its content. If the paragraph changes by even a comma, the ID changes, and all annotations pointing to it are invalidated. If it stays the same, annotations survive across revisions. The system catches exactly what changed and reprocesses only what must be reprocessed.
Rob saved his most deliberately provocative point for last: "Let's embrace two-column journals." He paused. "I am saying this to make people slightly unhappy."
Slide 11 — A ScienceDirect article at full screen: almost entirely white space, and users prefer it that way
He showed a screenshot of a ScienceDirect article expanded to full monitor width. It was almost entirely white. And after twenty years of experimentation, thousands of consumer interactions, and every possible layout variation, the product team had concluded: this is what users want. A fixed, comfortable reading width. Not fluid reflow. Not responsive columns that reshape themselves. A settled, human-scaled line length — which is, essentially, what a print column provides.
If you accept a fixed column width, half the rendering problems disappear. Math breaking is deterministic. HTML and PDF can converge. The entire campaign for fluid, responsive article layout was, in his view, solving the wrong problem. Two-column wasn't the enemy. Reflow was.
The Answer: CP/LD
After thirty years of reflections and a tour through ideas that were true but irrelevant, Rob proposed a concrete path forward. Not another XML standard. Not a revision to the DTD. Something structurally different: Content Profiles / Linked Data (CP/LD), an official NISO standard published in 2023.
Slide 13 — CP/LD: HTML5 as canvas, JSON-LD annotations, SHACL for validation
CP/LD has two components. A "canvas" in HTML5 — a cleanly rendered web page, without over-engineered nested divs, that just displays the article as it should look. And all semantic content encoded as annotations in JSON-LD, validated using SHACL schemas.
"In a SHACL schema, you can make SPARQL queries within the graph. Now that is so beautiful. You can do a SPARQL query — do all the destinations really have a label? Yes. Then we can push it onto Crossref."
— Rob Schrauwen
This solves the validation problem that DTDs could never crack. Instead of checking local structure, SPARQL queries inside SHACL schemas can traverse the entire graph and ask relational questions. Does every cross-reference target have a label? Do all author entities link to verified organizations? These questions are impossible in a DTD. They're natural in SHACL.
Slide 14 — What CP/LD solves; what remains genuinely uncertain
Rob was candid about the hard questions. Can the industry afford to abandon the guarantees XML DTDs provide? Is the tooling ecosystem as mature? How deeply embedded is XML in existing publishing pipelines? These aren't rhetorical — they're genuine obstacles.
But he closed with something that stuck: sometimes the value of a step change isn't technical. It's psychological.
"The moment you make a step change, people suddenly feel more free to think about new ideas. Whereas if you stick with the old schema and make adjustments, people apply all the old conventions. When you change, people suddenly feel more free."
— Rob Schrauwen
What a Six-Year-Old Knew
Rob Schrauwen came to Thiruvananthapuram having spent more years thinking about content standards than most of his audience had been alive. He'd built the thing, broken it, rebuilt it, watched ideas fail in slow motion over decades, and arrived with thirty years of reflections that he delivered with the self-deprecating candor of someone who'd learned more from failure than from success.
The granddaughter story wasn't a digression. It was the proof.
Emily understood, instinctively, that data management is about trust. Trust built through transparent methodology, clear policy, unique identifiers, and the courage to admit uncertainty rather than guess. She got 7/8 recall. Nobody doubted her. The teacher believed the data.
Elsevier spent thirty years aiming for perfect structure and got something far more complicated. The lesson Rob drew: precision and recall across the corpus are everything. Provenance at every step. Stay true to the source. And the tag that actually tags — the sticky note placed on the outside of the document — is more powerful than the angle bracket that tries to contain it from within.
He ended not with a grand conclusion but with an invitation: come find me during the next three days. Tell me if what I described doesn't already exist. Prove me wrong. I'm ready to hear it.
That, too, is a kind of data quality principle: staying permanently open to revision.
30 years of content standards at Elsevier, distilled to eight lessons
01
Good ideas can be true and irrelevant at the same time
Separation of content and presentation, universal republishing, XML-first everything — all correct in principle, all unused in practice. Being right about the future doesn't mean the future will use what you built for it.
02
Annotations beat markup
A sticky note outside the document beats angle brackets inside it. Stand-off annotation supports simultaneous multiple opinions, overlapping claims, absent data, and multiple contributors — things inline markup fundamentally cannot.
03
Model the data, not the world
An affiliation field that says "Associate Professor of Pathology" is not an error — it's what was submitted. Accept it as data; annotate with precision on the outside. Never force real-world messiness into an idealized schema.
04
Validate on departure, ignore on arrival
Producers should do exhaustive "blood tests" before sending. Consumers should pick only what they need and ignore everything else. When everyone validates the same schema, innovation freezes and a single DTD change becomes a cross-team crisis.
05
The corpus matters more than any single article
A new data point is only valuable if it's consistent across 100 million articles, not just the latest ones. Design for the ocean beneath the waves. Precision and recall across the whole corpus are everything.
06
Stay totally true to the source
Silently "fixing" data that seems wrong destroys provenance. If page 38 in the article disagrees with PubMed's page 37, don't replace it — record both, with source. You don't know which is right, and you may never know.
07
Step changes beat incremental modernization
Tweaking an old schema keeps all the old mental models in place. A genuine break — like CP/LD — creates psychological space for genuinely new thinking. Sometimes the real value of a new standard is what it allows people to un-assume.
08
Trust comes from transparency, not from perfection
Emily's family tree had 7/8 recall. One grandmother was missing. Nobody doubted the data. Trust is built through clear policy, consistent identifiers, honest provenance, and focus — not through being right about everything.