Derived datasets are bigger on Hugging Face Hub than people realise. ~73% of analysed datasets on the Hub are derivatives of something else, i.e. cleaned, translated, extended, etc. Built an explorer that infers the missing lineage from content: huggingface.co/spaces/davan...
Daniel van Strien
Machine Learning Librarian at Hugging Face
Articles & links
You can now run SQL over 2.19 BILLION web pages — zero download. @commoncrawl.bsky.social April 2026 crawl + URL index are on Hugging Face Storage Buckets. DuckDB reads it straight over hf:// — I counted all 2.19B in ~35s. Or point your own agent at it 👇 huggingface.co/spaces/…
Zero-setup UV script, runs on HF Jobs: huggingface.co/datasets/uv-... Example: 49 NLS Advocates Library cards → JSON: huggingface.co/datasets/dav... Model by NuMind: huggingface.co/numind/NuExt...
Zero-setup UV script, runs on HF Jobs: huggingface.co/datasets/uv-... Example: 49 NLS Advocates Library cards → JSON: huggingface.co/datasets/dav... Model by NuMind: huggingface.co/numind/NuExt...
Zero-setup UV script, runs on HF Jobs: huggingface.co/datasets/uv-... Example: 49 NLS Advocates Library cards → JSON: huggingface.co/datasets/dav... Model by NuMind: huggingface.co/numind/NuExt...
What is "AI for libraries" beyond a catalogue chatbot? IMO: design patterns (OCR, extraction, classification, search), and agents that both run them and develop the small models behind them. As part of work with @natlibscot.bsky.social started a book on this: danielvanstrien.x…
Built a useful GLAM model this week: flags blank index cards before expensive OCR. Cheap, scales, adapts. IMO the technical barrier is now very low, the main one is knowing how. So I wrote this up as a chapter: danielvanstrien.xyz/ai-patterns-for-glam/patterns/index-card-class…
Recent commentary
NuExtract3 (4B, Apache-2.0) does OCR *and* structured extraction. Point it at a dataset of scanned index cards + a JSON schema → clean catalog JSON One command on huggingface Jobs. (or skip the schema for plain Markdown OCR) Script + dataset 👇