Daniel van Strien

Machine Learning Librarian at Hugging Face

Machine Learning Librarian at @hf.co

Articles & links

You can now run SQL over 2.19 BILLION web pages — zero download. @commoncrawl.bsky.social April 2026 crawl + URL index are on Hugging Face Storage Buckets. DuckDB reads it straight over hf:// — I counted all 2.19B in ~35s. Or point your own agent at it 👇 huggingface.co/spaces/…

huggingface.co
View on Bluesky · ♥ 51 ↻ 10 ↩ 1 · 7d ago

Zero-setup UV script, runs on HF Jobs: huggingface.co/datasets/uv-... Example: 49 NLS Advocates Library cards → JSON: huggingface.co/datasets/dav... Model by NuMind: huggingface.co/numind/NuExt...

huggingface.co
View on Bluesky · ♥ 3 ↻ 0 ↩ 0 · 9d ago

Zero-setup UV script, runs on HF Jobs: huggingface.co/datasets/uv-... Example: 49 NLS Advocates Library cards → JSON: huggingface.co/datasets/dav... Model by NuMind: huggingface.co/numind/NuExt...

huggingface.co
View on Bluesky · ♥ 3 ↻ 0 ↩ 0 · 9d ago

Zero-setup UV script, runs on HF Jobs: huggingface.co/datasets/uv-... Example: 49 NLS Advocates Library cards → JSON: huggingface.co/datasets/dav... Model by NuMind: huggingface.co/numind/NuExt...

huggingface.co
View on Bluesky · ♥ 3 ↻ 0 ↩ 0 · 9d ago

What is "AI for libraries" beyond a catalogue chatbot? IMO: design patterns (OCR, extraction, classification, search), and agents that both run them and develop the small models behind them. As part of work with @natlibscot.bsky.social started a book on this: danielvanstrien.x…

danielvanstrien.xyz
View on Bluesky · ♥ 45 ↻ 12 ↩ 6 · 2 from the directory shared this · 11d ago

Built a useful GLAM model this week: flags blank index cards before expensive OCR. Cheap, scales, adapts. IMO the technical barrier is now very low, the main one is knowing how. So I wrote this up as a chapter: danielvanstrien.xyz/ai-patterns-for-glam/patterns/index-card-class…

danielvanstrien.xyz
View on Bluesky · ♥ 16 ↻ 5 ↩ 1 · 3d ago

Recent commentary

NuExtract3 (4B, Apache-2.0) does OCR *and* structured extraction. Point it at a dataset of scanned index cards + a JSON schema → clean catalog JSON One command on huggingface Jobs. (or skip the schema for plain Markdown OCR) Script + dataset 👇

View on Bluesky · ♥ 39 ↻ 13 ↩ 2 · 9d ago