arxiv.org web signal June 30th 2026

NarraBERT maps narrative structure across Dolma's 3T tokens

TL;DR

A new arXiv preprint introduces NarraBERT, a RoBERTa-based classifier, and applies it to 3 million passages from the 3-trillion-token Dolma corpus.
The framework operationalizes three narrative elements, agency, setting, and events, across 11 interpretable dimensions, trained on 400 annotated passages.
The authors report narrative qualities are unequally distributed across pretraining sources and topics in ways current curation practices do not measure.

A short paper on arXiv this month does something the pretraining data conversation rarely does, which is look at the actual narrative shape of what large models eat. The team, writing in an arXiv preprint submitted on June 17, builds a classifier called NarraBERT and runs it across three million passages drawn from Dolma, the three-trillion-token open pretraining corpus.

The framework they use is small enough to be legible. Three core narrative elements, agency, setting, and events, each broken out across eleven interpretable dimensions. They hand-annotate four hundred diverse passages, fine-tune a RoBERTa-based model on that seed set, and then let it loose on the three-million-passage sample to produce a derivative dataset they call NarraDolma. Both the classifier and the labeled dataset are released publicly.

Why this matters if you are not building a pretraining filter yourself: the quality, language, and toxicity classifiers that gate modern web corpora say nothing about whether the text is story-shaped or expository, agentic or descriptive. The authors' claim is that narrative qualities are unequally distributed across pretraining sources and topics in ways that current curation practices neither measure nor account for. If that holds up under scrutiny, narrative becomes a new axis curators can choose to balance, or at least audit.

The honest caveats are visible from the abstract alone. Four hundred annotated passages is a small seed, and nothing in the abstract demonstrates that adjusting narrative composition actually improves any downstream task. What the reporting doesn't give you, because it is a preprint, is independent replication, the per-dimension reliability numbers, or a head-to-head comparison with other web corpora like RedPajama or FineWeb. The forward-looking part worth watching is whether anyone takes NarraDolma off the shelf and runs the next experiment, which is correlating narrative composition with model behavior on narrative reasoning.

Shared on Bluesky by 3 AI experts

Maria Antoniak @mariaa.bsky.social amplified

Juan Diego Rodriguez @juand-r.bsky.social

2) Characterizing Narrative Content in Web-scale LLM Pretraining Data, by @teagrjohnson.bsky.social, @elliottash.bsky.social, @andrewpiper.bsky.social, @mariaa.bsky.social arxiv.org/abs/2606.19468 Why: annotation and…
View on Bluesky →
Juan Diego Rodriguez @juand-r.bsky.social: 2) Characterizing Narrative Content in Web-scale LLM Pretraining Data, by @teagrjohnson.bsky.social, @elliottash.bsky.social, @andrewpiper.b… →
arxiv cs.CL @arxiv-cs-cl.bsky.social: Teagan Johnson, Elliott Ash, Andrew Piper, Maria Antoniak Characterizing Narrative Content in Web-scale LLM Pretraining Data https://arxiv.o… →

Originally reported by arxiv.org

Read the original article →

Original headline: Characterizing Narrative Content in Web-scale LLM Pretraining Data