Daniel van Strien

Machine Learning Librarian at Hugging Face

Machine Learning Librarian at @hf.co

Articles & links

Got a digitised collection that needs OCR? uv-scripts is a set of single-file Python scripts that OCR a whole image dataset to markdown in one command — 20+ open VLMs to pick from, nothing to install but uv. github.com/davanstrien/...

GitHub - davanstrien/uv-scripts-for-ai: Self-contained UV scripts for data & ML tasks — OCR, vision, audio & more — run one in a command, locally or on Hugging Face Jobs. Built for humans and agents. github.com
View on Bluesky · ♥ 38 ↻ 13 ↩ 1 · 2 from the directory shared this · 8d ago

You can now run SQL over 2.19 BILLION web pages — zero download. @commoncrawl.bsky.social April 2026 crawl + URL index are on Hugging Face Storage Buckets. DuckDB reads it straight over hf:// — I counted all 2.19B in ~35s. Or point your own agent at it 👇 huggingface.co/spaces/…

huggingface.co
View on Bluesky · ♥ 51 ↻ 10 ↩ 1 · 27d ago

Zero-setup UV script, runs on HF Jobs: huggingface.co/datasets/uv-... Example: 49 NLS Advocates Library cards → JSON: huggingface.co/datasets/dav... Model by NuMind: huggingface.co/numind/NuExt...

huggingface.co
View on Bluesky · ♥ 3 ↻ 0 ↩ 0 · 29d ago

Zero-setup UV script, runs on HF Jobs: huggingface.co/datasets/uv-... Example: 49 NLS Advocates Library cards → JSON: huggingface.co/datasets/dav... Model by NuMind: huggingface.co/numind/NuExt...

huggingface.co
View on Bluesky · ♥ 3 ↻ 0 ↩ 0 · 29d ago

Zero-setup UV script, runs on HF Jobs: huggingface.co/datasets/uv-... Example: 49 NLS Advocates Library cards → JSON: huggingface.co/datasets/dav... Model by NuMind: huggingface.co/numind/NuExt...

huggingface.co
View on Bluesky · ♥ 3 ↻ 0 ↩ 0 · 29d ago

What is "AI for libraries" beyond a catalogue chatbot? IMO: design patterns (OCR, extraction, classification, search), and agents that both run them and develop the small models behind them. As part of work with @natlibscot.bsky.social started a book on this: danielvanstrien.x…

AI Design Patterns for Information Professionals danielvanstrien.xyz
View on Bluesky · ♥ 45 ↻ 12 ↩ 6 · 2 from the directory shared this · 31d ago

Built a useful GLAM model this week: flags blank index cards before expensive OCR. Cheap, scales, adapts. IMO the technical barrier is now very low, the main one is knowing how. So I wrote this up as a chapter: danielvanstrien.xyz/ai-patterns-for-glam/patterns/index-card-class…

danielvanstrien.xyz
View on Bluesky · ♥ 17 ↻ 5 ↩ 1 · 23d ago

Recent commentary

NuExtract3 (4B, Apache-2.0) does OCR *and* structured extraction. Point it at a dataset of scanned index cards + a JSON schema → clean catalog JSON One command on huggingface Jobs. (or skip the schema for plain Markdown OCR) Script + dataset 👇

View on Bluesky · ♥ 39 ↻ 13 ↩ 2 · 29d ago

What could a rich ecosystem of small GLAM AI models enable? IMO: cheaper, better-fitted, more robust models. Example: I extended an existing @natlibscot.bsky.social archival card detector to 4 collections to make a more generic index card detector. Took an hour or two and minimal $

View on Bluesky · ♥ 14 ↻ 1 ↩ 1 · 16d ago

In Daniel van Strien's orbit

Center = Daniel van Strien. Left = members they follow (green edges). Right = members who follow them (blue edges). Top = mutual follows (orange edges, slightly larger). Drag any node to reposition; click to open that profile.