TUG 2025 narrativeRob Schrauwen: True but Irrelevant
TeX User Group 2025, 18 July 2025
True but Irrelevant
A family tree, a knowledge graph, and the uncomfortable truth that the best standards are the ones that survive contact with reality.
Rob Schrauwen did not begin with a diagram of XML. He began with a six-year-old girl in Amsterdam, a skeptical teacher, and a drawing that claimed a child had seven grandmothers. By the time the story reached Elsevier's knowledge graph, the joke had turned into a serious argument about provenance, precision, and why a publishing system should never silently rewrite what it sees.
The page includes the sketchnote, the transcript, the companion podcast, the embedded talk video, selected slides, XML examples, and researched references from TEI, W3C, NISO, NCBI, and the scholarly knowledge-graph literature.
Sketchnote
Click to open the full-size image in a new tab.
Large and clickable
TranscriptPrimary source for the narrative and quotes.
PodcastA Malcolm Gladwell and Tim Harford style companion discussion.
ResearchStandards, provenance, and annotation references.
The embedded player loads on demand. That keeps the page fast and avoids a broken third-party frame in browsers that do not like autoplaying embeds.
The drawing was the talk.
The room in Thiruvananthapuram knew it was getting a systems talk. Rob had been with Elsevier since 1991. He had been a mathematician. He had been the Chief Content Architect. He had the résumé of someone who could have spoken comfortably about DTDs, XML pipelines, and enterprise content strategy for an hour without once mentioning a child with colored pencils.
Instead he told a story about his granddaughter Emily, who at age six told her teacher that she had seven grandmothers. The teacher said that was impossible. Emily did what every good data engineer hopes users will do: she brought evidence. Her drawing had numbered grandmothers, a family tree, and a policy choice hidden inside it. Great-grandmothers counted too.
"I have seven grandmothers."
Rob's granddaughter, as retold from the transcript
The joke is that the child's drawing was not childish at all. It had a definition. It had a registry of entities. It had a unique identifier for each Oma. It even had a piece of defensive programming: when she did not know how to spell Hetty, she fell back to the generic label "Oma" instead of risking a bad fact.
That is the first large idea in the talk: data quality is not only about correctness. It is about provenance, definitions, and the courage to leave uncertainty visible when you do not know enough to tidy it away.
"We need to be able to mark up things that aren't there."
The sentence that turns a family anecdote into a content-standard manifesto
Elsevier's graph is bigger than a file.
Rob's next move was to zoom out from the drawing to the system he actually works on. Elsevier's knowledge graph is not a neat database table. It is a constantly re-argued map of research objects: 23.8 million full-text articles from ScienceDirect, 102.3 million Scopus records, author profiles, organization profiles, patents, standards, grant awards, and more.
The language he used was almost understated, which made it stronger: the platform harvests evidence, extracts occurrences, matches and clusters entities, and then forms a context view for each product. It is a machine for turning many partial truths into something usable without pretending the truth was ever simple.
"All these things are heavily probabilistic."
The hidden philosophy of the whole pipeline
Process diagram
Evidence becomes a graph, not a guess.
The talk keeps returning to one flow: evidence arrives, entities are extracted, duplicates are merged, and products receive a context view. The moving dots below make that logic visible.
From markup to annotation.
Rob's central pivot was not a new file format. It was a change in posture. He said that the team had moved from markup to annotation. That means the document stays the document, while the claims about the document live beside it.
This is where the standards trail becomes important. TEI's standOff container explicitly supports linked data, contextual information, and stand-off annotations. The W3C Web Annotation Data Model does the same job on the web: a body, a target, and a relationship that can be shared across platforms.
Rob's line about wanting to make assertions about the LaTeX document rather than the XML document lands because it is funny and serious at the same time. He is not anti-XML. He is anti-collapsing everything into one fragile layer.
"Annotations are even more tagging than tagging."
His sticky-note analogy, sharpened into a design principle
Inline markup
<!-- The source file carries the claim inside it --><articlexml:id="art-17">
<title>True but Irrelevant</title><p>
The copy editor changed <italic>page 38</italic>
to fit the house style.
</p></article>
Stand-off annotation
<!-- The claim lives outside the source and points back to it --><standOff><annotationxml:id="ann-38"target="#art-17"resp="#copy-editor"cert="0.98">
<note>page 38</note></annotation></standOff>
The difference between the two examples is almost comic until you realize how much institutional pain hides inside it. Inline markup is tidy until two people need to disagree about the same line. Stand-off annotation looks messier, but it lets one person say "author", another say "copy editor", and a third preserve the original text without flattening the argument.
Historical audit
The 1990s were right, and not enough.
Rob revisited the promises made when SGML and XML were still new. His verdict was not that they were false. It was that many were true but irrelevant unless they changed the workflow around them.
True, but Rob points out the hard parts never went away: tables, math, flexible naming, and display rules still forced content to remember layout.
2. Re-publish content
True in theory, but exact re-publication was rare. The same article usually showed up with a new DOI, a new context, or a new editorial life.
3. XML-first
True as discipline, but dangerous as dogma. Once a system hardens, even small changes become expensive and organizationally noisy.
"The DTD was wrong all along."
Rob's bluntest line, and the one most likely to start a hallway argument
The strongest comic detail in this section is the "emphasis 1" to "emphasis 9" anecdote. Someone had decided that bold and italic were too crude, so the DTD grew a ladder of increasingly fine-grained emphasis levels. Rob's response was simpler: interpretation can go wrong, and when it does, it tends to go wrong in the exact place where users least want surprises.
The dog-food confession matters because it makes the talk self-inverting. The company that spent years teaching the world about XML discipline eventually found some of its own documentation easier to manage in Markdown. Not because Markdown is better for everything, but because real workflows punish overfit standards. The phrase on the slide was short: hard to write, hard to display.
Validate on departure, ignore on arrival.
This was Rob at his most operational. The producer does the tests. The receiver picks what it needs. The recipient does not get to pretend that incoming data is perfect, only to treat the source as if it were already fixed.
"Schema validation alone never has been sufficient."
The reason the system needs provenance as much as it needs syntax
If that sounds obvious, it is only obvious after decades of broken content. The interesting point is the asymmetry: data should be validated before it leaves the producer, not "cleaned up" by a downstream system that no longer knows what changed.
Model the data, not the world.
Slide 9 was one of the most elegant in the deck because it reduced a large philosophy to three small corrections:
Affiliation strings aren't affiliations
An affiliation string is a representation. The institution is a modeled thing with context, identity, and history.
Reference strings aren't references
A citation string is not the same as a fully resolved, provenance-bearing bibliographic object.
Bylines aren't authors
Names on a page are useful evidence, but they are still evidence, not the final truth.
That is why Rob kept insisting on a common vocabulary and on the ability to represent absence. It is also why he kept returning to time, because content that changes without a record of change ceases to be content and becomes rumor.
Articles are not alone.
The platform view becomes more interesting once you realize that content changes outside the article itself: copyright notices move on separate workflows, authors revise, copy editors intervene, and the corpus keeps growing. Rob's answer was to optimize for revision resiliency, reproducible identifiers, and corpus-wide consistency rather than fetishizing a single pristine file.
"The whole database matters."
Rob's reminder that the unit of value is the corpus, not the page
That is also where the phrase "precision and recall" stops being academic jargon. When the thing you are building has 100 million records, small error rates are no longer small. They are a business model, a trust problem, and a UX problem.
Selected slides
The slide deck is the skeleton.
These slides carry the argument in compressed form. The images below are thumbnails; each opens the matching PDF page in a named window.
Rob did not answer that question with a moonshot. He answered it with a standard: Content Profiles / Linked Data (CP/LD). The important thing about CP/LD is not that it sounds modern. It is that it makes the separation he keeps talking about concrete: content, semantics, and display can be linked without collapsing into one giant schema.
That conclusion lines up with the standards literature. The RDF 1.1 Concepts and Abstract Syntax document defines RDF as a graph data model. JSON-LD 1.1 gives you a JSON-friendly way to serialize linked data. SHACL gives you a way to validate shapes. That stack is very close to Rob's "clear vocabulary" plus "language for validation" wish list.
Provenance matters too. W3C's PROV-DM explicitly defines provenance as information about entities, activities, and people involved in producing a thing, which is almost exactly the intuition Rob was applying when he refused to let a downstream system silently overwrite source data.
The closest thing to a counterweight is not a contradiction, but a caution. NISO's own CP/LD material says the standard complements existing models instead of replacing them. That is the adult version of Rob's argument: the future is not a clean break from text; it is a layered system where the source stays visible and the claims about it become portable.
"The text must be extremely high quality."
His answer to the AI question, and the reason corpus curation still matters
The audience's question about AI made the point sharper. If large language models are going to sit in front of the scholarly corpus, then the original text becomes more valuable, not less. Rob's second answer was almost managerial, but it was also strategic: the knowledge graphs are what let Elsevier do something distinctive with those models. The graph is not an ornament. It is the memory.
His multilingual answer followed the same pattern. He did not promise magical universal formatting. He said the problem is almost impossible if you try to generate text from tags for every permutation, and that the better move is to separate content from annotations so the textual form can stay textual.
Research links
Standards that support Rob's direction.
Tap a card to open a popup with why the source matters.
My inference from those sources is simple: Rob is right that the next layer should preserve the source and attach meaning around it, but the broader standards ecosystem also suggests that annotation works best as a companion layer, not a replacement fantasy. That is the useful middle ground.
Takeaways
What stayed after the applause.
The talk is about XML only on the surface. Underneath, it is about trust, layered meaning, and the right place to put a claim.
1. Separate content from annotation.
That separation is what allows multiple opinions, provenance, and overlap to coexist without rewriting the source.
2. Never silently normalize away the source.
Rob's insistence on staying true to the source is a practical trust policy, not a purity test.
3. Model the data, not the world.
Strings are evidence. The modeled object is the thing you actually need to reason about.
4. The corpus is the unit.
At Elsevier scale, small errors matter because the value comes from consistency across the whole graph.
5. AI increases the premium on source quality.
If the front door is an LLM, the underlying text has to be cleaner, richer, and better grounded than ever.
6. Real change is a step change.
Rob's answer to standards drift is not fiddling around the edges. It is a new layer with enough clarity to reset habits.