True but Irrelevant | Rob Schrauwen, TUG 2025

In July 2025, in the heavy morning warmth of Thiruvananthapuram, the opening talk at the TeX User Group annual conference began with a crayon drawing.

Not a polished diagram. Not a carefully curated slide. A hand-drawn family tree on crumpled paper, made by a six-year-old girl named Emily, who had a problem to solve. She had told her teacher she had seven grandmothers. The teacher declared it impossible. Emily went home and built a proof.

Rob Schrauwen had been with Elsevier since 1991. He was trained as a mathematician. He'd spent three decades as Chief Content Architect, then VP of Data and Platform Strategy, overseeing the infrastructure that processes and connects tens of millions of scientific articles. When he stood up to open TUG 2025, he pulled out a photograph of a child's drawing and said: this is what I want to talk about.

"She produced this — a drawing that my granddaughter made when she was six. We can learn a lot from it." — Rob Schrauwen

Emily's hand-drawn family tree with 7 grandmothers, each assigned a unique circled number

Slide 2 — Emily's knowledge graph, complete with unique "Oma IDs"

The drawing was, Rob explained, a masterpiece of knowledge graph design. Emily had established a clear policy: great-grandmothers count as grandmothers. She'd assigned each grandmother a unique circled number — a unique "Oma ID." She'd included men in the picture purely as structural evidence, because a person can only be a grandmother if they are the mother of someone. And she'd focused ruthlessly on the task: only grandmothers are numbered. Nothing else.

Then came the detail that stopped the room. When Emily wasn't certain whether "Hetty" was spelled with a "y" or with "ie," she simply wrote "Oma" — reverting to the generic label rather than risk introducing a spelling error that would undermine her entire case.

"She basically invented continuous data quality or defensive programming. If you're not sure about the facts, you revert to the generic fact. Because if she would have misspelled Hetty, the teacher would have said, 'Oh, this whole story is nonsense because you can't even spell the name of your grandmother.'" — Rob Schrauwen

The punchline? Emily actually had eight grandmothers. Her own was missing from the tree — Rob was standing right next to "Oma 1" in the drawing. The recall was seven-eighths. Imperfect.

And yet nobody doubted the data. The trust was complete. Trust built not through infallibility, but through transparency of method, clarity of policy, and total focus on the problem at hand.

This trust is based on transparency, clear policies, and total focus on the business problem at hand.

Rob had spent thirty years trying to build exactly this at one of the world's largest scientific publishers. Here it was, in purple crayon, by a six-year-old.

The Machine Behind One Hundred Million Articles

To understand what Rob was trying to say, you need to understand what Elsevier's data platform actually does. Because the scale of it is staggering even by the standards of big tech.

Elsevier's knowledge graph contains 23.8 million full-text articles from ScienceDirect, 102.3 million records from Scopus, 40 million researcher profiles, and 100,000 carefully curated organizations — all linked together. Not just knowledge about research, but knowledge in research: the ideas, methods, findings, and connections that live inside the papers themselves.

Research: Scholarly Knowledge Graphs review (Springer, 2022) — MIT Press 2024: building knowledge graphs for research assessment

The pipeline that generates all this is five steps: harvest evidence about entities, extract occurrences of those entities, match and cluster them together, form a context view for each product, and make the datasets available. The critical word Rob used again and again: probabilistic. The machine doesn't know for certain. It estimates, with provenance attached.

That insistence on provenance — on recording not just what you know but how you know it — is what makes the system trustworthy. And it's what leads directly to Rob's most fundamental argument.

"It is extremely important that we stay totally true to the source. If some page number says 38 in the article, and it is 37 in PubMed, and maybe we replace it by 37 — I think with that, we have already ruined the entire knowledge graph." — Rob Schrauwen

Vendors had been doing this "helpfully" for years: silently correcting data that seemed wrong. Rob's position: that's catastrophe. You don't know that PubMed is right. You don't know that the difference isn't meaningful. Stay true to the source. If you must annotate, do it outside the document and mark it clearly as your assertion, not the author's.

From Tagging to Real Tagging

Here is where Rob landed his central insight — and where the talk earned its title.

For decades, scientific publishing has been built on markup: wrapping text in XML tags. <author>, <affiliation>, <reference>. The vocabulary evolved, the DTDs grew elaborate, and enormous effort went into encoding every element of every article into a structured hierarchy. The assumption was: get the structure right, and everything else follows.

Rob's counter: markup is the wrong tool entirely. The right tool is annotation.

"A tag that you stick onto it like a sticky note is far more like a tag than an XML tag. So I say we go from tagging to real tagging — and put a sticky note onto it." — Rob Schrauwen

Slide: Success factors of annotation vs limitations of markup

Slide 4 — The core argument: sticky notes beat angle brackets

Markup, Rob explained, fails in four fundamental ways: it's unintuitive to mark up things that aren't there; it works best only when everything is in a single file; it's impossible to have overlapping markup; and you must change the content itself to fit the schema. Stand-off annotation — placing assertions outside the document — solves all four.

The Key Insight

When annotations live outside the document, multiple people can make simultaneous claims about the same text. You can record what's absent. You can make overlapping assertions. You can capture multiple conflicting opinions and weight them later. And you never touch the original.

Inline XML (what Rob called "markup")

<article>
  <author>Pierre and Marie Curie</author>
  <affiliation>Departments of Medicine
    and Cardiology in India</affiliation>
</article>

One author. One affiliation. Reality ignored in favour of schema.

Stand-off annotation (what Rob proposed)

{
  "@context": "https://schema.org",
  "@graph": [{
    "@id": "#au1",
    "@type": "Person",
    "name": "Pierre and Marie Curie",
    "alternateName": [
      {"@type": "Name", "name": "Pierre Curie"},
      {"@type": "Name", "name": "Marie Curie"}
    ]
  },{
    "@type": "Organization",
    "name": "Departments of Medicine and Cardiology",
    "location": "India",
    "memberOf": {
      "@type": "Organization",
      "name": "Departments of Medicine and Cardiology"
    }
  }]
}

Both people recorded. Both departments. Provenance attached. Original text untouched.

TEI on stand-off markup: "Stand-off markup allows source text to remain read-only and unaltered while annotations reside in separate documents" — Egan, 2012: overlapping scholarly interpretations require stand-off markup

Rob had been moving Elsevier toward this model for years, using RDF and what he called the "golden rules for data": always add provenance; respect for zero (mark up what's absent, not just what's present); anchor everything in time; use a common vocabulary; and allow multiple opinions about the same thing.

That last rule is perhaps the most important. In a markup world, there is one slot for "author" and one answer goes there. In an annotation world, you can record that the author said X, PubMed says Y, and an algorithm concluded Z — all simultaneously, with their weights and sources preserved. The corpus decides which to trust.

Thirty Years of Things That Were True but Irrelevant

The title of the talk wasn't merely clever. It was the organizing principle of Rob's entire career narrative — and it's more useful than it first appears.

Being "true but irrelevant" is a specific kind of failure. It's not wrong. It's not stupid. The ideas Rob described were often genuinely insightful, technically correct, and deeply considered by very smart people. They just turned out not to matter in the way anyone expected. The future that arrived was different from the future that was designed for.

Slide: Reflections — Did what we said in the 1990s materialize?

Slide 6 — Three grand promises; three polite disappointments

The Emphasis Nine Problem

One story Rob told is quietly devastating. In the 1980s, Elsevier's predecessors built a DTD with not <bold> and <italic>, but <emphasis1> through <emphasis9>. The vision was semantic purity: if you know that E. coli is a species name, mark it as <species>, not merely italic. The system could then do intelligent things with that semantic information.

History: Elsevier XML DTD documentation — Sebastian Rahtz, 1995: "Another Look at LaTeX to SGML Conversion" (TUGboat) — "XML-first publishing" vendor candidly catalogues its own failures

The vision was philosophically correct. The execution was practically irrelevant. No author would ever accept that their bold and italic might "swap around." The semantic markup was never used. And Rob's conclusion:

"To add an interpretation can only go wrong. It's far better if we just use italic and annotate on the side if we want to." — Rob Schrauwen

Eating Your Own Dog Food — Or Rather, Not

Here Rob delivered the confession that got the biggest laugh of the morning. As Chief Content Architect of Elsevier, he was responsible for the Elsevier DTD. He even wrote two books about it — "tag by tag" — 444 pages each, documenting every element of the standard he'd designed.

He wrote those books in LaTeX.

Slide: Reflections — Eat your own dog food. Tag by Tag books shown.

Slide 7 — The Chief Content Architect wrote his documentation in LaTeX, not Elsevier XML

"I was Chief Content Architect of Elsevier, telling everybody you have to be very, very precise on the tagging. But I have never in all those 30 years coded a third-order heading in Word as a third-order heading. I just made it italic and blue, over and over again. And this worked." — Rob Schrauwen

He published an internal newsletter for twenty-nine years. Hundreds of employees subscribed. He formatted it manually — italic, blue — every single time. No semantic markup. No proper heading structure. And not once did it interfere with anyone's ability to read, use, or benefit from it.

He called this a "big disconnect." It is also, quietly, a piece of evidence: that perfect structure, while intellectually correct, can be practically irrelevant.

Validate on Departure, Ignore on Arrival

One of the sharpest observations in the talk concerns who validates data — and when.

The traditional model: every consumer validates the file against the same schema as the producer. This sounds rigorous. It's a trap. The moment you need to update a DTD, every downstream consumer must update simultaneously. Innovation freezes. A single change requires coordinating dozens of teams.

Slide: Validate on departure, ignore on arrival — the blood tests diagram

Slide 8 — The "blood tests" model: thorough validation before sending, selective checking on arrival

"What they should have done: create their own data model, pick from the file what they need, check whatever they like about that — but ignore everything that they don't know." — Rob Schrauwen

The better model — which Rob called "blood tests" — has producers doing exhaustive checks before dispatch. Consumers pick only what they need and verify only that. The corpus is robust enough to survive some errors, provided you detect and repair them quickly. Perfection at every node isn't the goal; resilience of the whole is.

Modeling the Data, Not the World

Data modelers make a subtle but fatal mistake when they model how they think the world should be, rather than how the data actually is.

Slide: Data modeling = modeling the data, with real-world messy affiliation examples

Slide 9 — What "affiliations" and "references" actually look like in the wild

Rob showed examples from real published articles. An affiliation field containing "Associate Professor of Pathology" — not an institution at all, but a job title. Another: "Departments of Medicine and Cardiology in India." Another still: a street address. In a reference footnote, a single run-on sentence containing three separate citations packed together.

The data modeler's instinct: these are errors. Clean them up. Force them into the schema. Rob's counter:

"I don't model the words, I model the data. And this is what people do, so let them. True that these are two departments, but irrelevant that we have to split it. And that is all thanks to the fact that we don't structure, but that we annotate." — Rob Schrauwen

With stand-off annotation, you can point at "Departments of Medicine and Cardiology in India" and assert: two organizations are named here. You never touch the original text. The author's words remain exactly as they wrote them. The semantic content — the two departments, the country, the relationship — lives in the annotation layer, precise and queryable, while the document layer retains its honest messiness.

The Ocean and the Waves

Rob offered a metaphor worth keeping. New articles flowing in are the waves: dramatic, visible, consuming everyone's attention. Underneath is four kilometers of water mass — the corpus of 100 million articles that already exists.

Most data systems are designed for the waves. They process each new item as it arrives. But the ocean matters equally. When a new data point is introduced — a new way of identifying research topics, a new entity resolution algorithm — it's only valuable if it's consistent across the entire corpus, not just the new arrivals.

"A new data point can only be good if it is useful across the corpus. If we are so happy that we have it in this one article, it makes no sense to me if it is not consistently done." — Rob Schrauwen

This led to one of his more elegant inventions: reproducible identifiers. A paragraph's ID is computed as a hash of its content. If the paragraph changes by even a comma, the ID changes, and all annotations pointing to it are invalidated. If it stays the same, annotations survive across revisions. The system catches exactly what changed and reprocesses only what must be reprocessed.

Further reading: Content-addressable storage (Wikipedia) — Keep: CAS enabling reproducible computation on large datasets

Let's Embrace Two-Column Journals

Rob saved his most deliberately provocative point for last: "Let's embrace two-column journals." He paused. "I am saying this to make people slightly unhappy."

Slide: ScienceDirect article displayed at full monitor width — almost entirely white space on both sides

Slide 11 — A ScienceDirect article at full screen: almost entirely white space, and users prefer it that way

He showed a screenshot of a ScienceDirect article expanded to full monitor width. It was almost entirely white. And after twenty years of experimentation, thousands of consumer interactions, and every possible layout variation, the product team had concluded: this is what users want. A fixed, comfortable reading width. Not fluid reflow. Not responsive columns that reshape themselves. A settled, human-scaled line length — which is, essentially, what a print column provides.

If you accept a fixed column width, half the rendering problems disappear. Math breaking is deterministic. HTML and PDF can converge. The entire campaign for fluid, responsive article layout was, in his view, solving the wrong problem. Two-column wasn't the enemy. Reflow was.

The Answer: CP/LD

After thirty years of reflections and a tour through ideas that were true but irrelevant, Rob proposed a concrete path forward. Not another XML standard. Not a revision to the DTD. Something structurally different: Content Profiles / Linked Data (CP/LD), an official NISO standard published in 2023.

Standards: ANSI/NISO Z39.105-2023 (CP/LD) — official NISO page — CP/LD Working Group — SHACL — W3C Recommendation — JSON-LD 1.1 — W3C Recommendation — RDF — W3C

Slide: Answer — Content Profiles / Linked Data (CP/LD), showing the NISO standard page

Slide 13 — CP/LD: HTML5 as canvas, JSON-LD annotations, SHACL for validation

CP/LD has two components. A "canvas" in HTML5 — a cleanly rendered web page, without over-engineered nested divs, that just displays the article as it should look. And all semantic content encoded as annotations in JSON-LD, validated using SHACL schemas.

"In a SHACL schema, you can make SPARQL queries within the graph. Now that is so beautiful. You can do a SPARQL query — do all the destinations really have a label? Yes. Then we can push it onto Crossref." — Rob Schrauwen

This solves the validation problem that DTDs could never crack. Instead of checking local structure, SPARQL queries inside SHACL schemas can traverse the entire graph and ask relational questions. Does every cross-reference target have a label? Do all author entities link to verified organizations? These questions are impossible in a DTD. They're natural in SHACL.

Slide: Discussion — what CP/LD gets right and what remains uncertain

Slide 14 — What CP/LD solves; what remains genuinely uncertain

Rob was candid about the hard questions. Can the industry afford to abandon the guarantees XML DTDs provide? Is the tooling ecosystem as mature? How deeply embedded is XML in existing publishing pipelines? These aren't rhetorical — they're genuine obstacles.

But he closed with something that stuck: sometimes the value of a step change isn't technical. It's psychological.

"The moment you make a step change, people suddenly feel more free to think about new ideas. Whereas if you stick with the old schema and make adjustments, people apply all the old conventions. When you change, people suddenly feel more free." — Rob Schrauwen

What a Six-Year-Old Knew

Rob Schrauwen came to Thiruvananthapuram having spent more years thinking about content standards than most of his audience had been alive. He'd built the thing, broken it, rebuilt it, watched ideas fail in slow motion over decades, and arrived with thirty years of reflections that he delivered with the self-deprecating candor of someone who'd learned more from failure than from success.

The granddaughter story wasn't a digression. It was the proof.

Emily understood, instinctively, that data management is about trust. Trust built through transparent methodology, clear policy, unique identifiers, and the courage to admit uncertainty rather than guess. She got 7/8 recall. Nobody doubted her. The teacher believed the data.

Elsevier spent thirty years aiming for perfect structure and got something far more complicated. The lesson Rob drew: precision and recall across the corpus are everything. Provenance at every step. Stay true to the source. And the tag that actually tags — the sticky note placed on the outside of the document — is more powerful than the angle bracket that tries to contain it from within.

He ended not with a grand conclusion but with an invitation: come find me during the next three days. Tell me if what I described doesn't already exist. Prove me wrong. I'm ready to hear it.

That, too, is a kind of data quality principle: staying permanently open to revision.

True but
Irrelevant

The Machine Behind One Hundred Million Articles

From Tagging to Real Tagging

Thirty Years of Things That Were True but Irrelevant