Who's Who of AI

Daniel van Strien

Machine Learning Librarian at Hugging Face

254 trust practitioner @danielvanstrien.bsky.social · 4,582 followers

AI research

Why they matter

Machine Learning Librarian at Hugging Face with public evidence across AI research.

AI signals: 7
Sources: 2
Discussions: 1
Latest signal: 13d ago

View every signal from Daniel van Strien →

Machine Learning Librarian at @hf.co

What they're sharing

Articles & links

Got a digitised collection that needs OCR? uv-scripts is a set of single-file Python scripts that OCR a whole image dataset to markdown in one command — 20+ open VLMs to pick from, nothing to install but uv. github.com/davanstrien/...

GitHub - davanstrien/uv-scripts-for-ai: Self-contained UV scripts for data & ML tasks — OCR, vision, audio & more — run one in a command, locally or on Hugging Face Jobs. Built for humans and agents. github.com

AI Weekly's analysis →

Each script is a self-contained Python file using PEP 723 inline dependency declarations, runnable with a single `uv run` command.
Nine task categories are covered including OCR with 30+ models, audio transcription, vision detection, embeddings, and LLM inference.
Scripts use standardized argument patterns so both humans and AI agents can run them locally or on Hugging Face Jobs GPU infrastructure.

Read full analysis →

View on Bluesky · ♥ 38 ↻ 13 ↩ 1 · 2 from the directory shared this · 48d ago

Derived datasets are bigger on Hugging Face Hub than people realise. ~73% of analysed datasets on the Hub are derivatives of something else, i.e. cleaned, translated, extended, etc. Built an explorer that infers the missing lineage from content: huggingface.co/spaces/davan...

Dataset Lineage Explorer - a Hugging Face Space by davanstrien huggingface.co

View on Bluesky · ♥ 17 ↻ 3 ↩ 1 · 2 from the directory shared this · 62d ago

You can now run SQL over 2.19 BILLION web pages — zero download. @commoncrawl.bsky.social April 2026 crawl + URL index are on Hugging Face Storage Buckets. DuckDB reads it straight over hf:// — I counted all 2.19B in ~35s. Or point your own agent at it 👇 huggingface.co/spaces/…

The April 2026 Web by the Numbers - a Hugging Face Space by davanstrien huggingface.co

View on Bluesky · ♥ 51 ↻ 10 ↩ 1 · 67d ago

Small open models are getting genuinely good at document parsing: OvisOCR2 (0.9B, Apache 2.0) is claiming SOTA on OmniDocBench v1.6. Day-1 recipe: OCR a whole HF image dataset — digitised newspapers, archives, zines — to markdown with one command. huggingface.co/datasets/uv-sc…

uv-scripts/ocr · Datasets at Hugging Face huggingface.co

View on Bluesky · ♥ 34 ↻ 2 ↩ 0 · 14d ago

Open ASR models with speaker diarization are now fast and cheap: I diarized 174 hours of Apollo 11 mission audio (the real July 1969 NASA tapes) for $9.46 with a 0.9B open model. Search it, hear any moment on the original tape: huggingface.co/spaces/davan...

Apollo 11 — Search the Mission Audio - a Hugging Face Space by davanstrien huggingface.co

View on Bluesky · ♥ 22 ↻ 1 ↩ 3 · 19d ago

It's unedited machine output over scratchy radio (it keeps hearing "Follow eleven, this is Houston"). The recipe runs on any audio collection in one command: huggingface.co/datasets/uv-...

uv-scripts/transcription · Datasets at Hugging Face huggingface.co

View on Bluesky · ♥ 2 ↻ 0 ↩ 0 · 19d ago

The transcript is a finding aid, not a replacement — every segment links back to the Internet Archive originals. And the pipeline flagged which reels are blank carrier hiss: useful collection metadata in its own right. Dataset (CC0): huggingface.co/datasets/dav...

davanstrien/apollo-11-diarized · Datasets at Hugging Face huggingface.co

View on Bluesky · ♥ 1 ↻ 0 ↩ 1 · 19d ago

Explore it in the Dataset Viewer, no code needed: - the monthly leaderboard + how it shifts as new tools launch - request share vs user share (a few heavy pipelines vs many light users) - daily data: launch spikes, weekday vs weekend patterns... huggingface.co/datasets/hug...

huggingface/agent-usage · Datasets at Hugging Face huggingface.co

View on Bluesky · ♥ 1 ↻ 0 ↩ 0 · 26d ago

You can now use 100s of tools with @hf.co Buckets, thanks to the new S3 API! Usually just one or two lines to change. huggingface.co/docs/hub/sto...

S3 Compatibility · Hugging Face huggingface.co

View on Bluesky · ♥ 6 ↻ 0 ↩ 0 · 28d ago

Model, datasets, a live demo and the full recipe are all open, built on openly shared collections. It even turns a recipe card into clean JSON! Blog: danielvanstrien.xyz/posts/2026/n... Demo: huggingface.co/spaces/small...

Index card extractor - a Hugging Face Space by small-models-for-glam huggingface.co

View on Bluesky · ♥ 9 ↻ 0 ↩ 0 · 34d ago

Demo: huggingface.co/spaces/davan...

The Recovered Page - a Hugging Face Space by davanstrien huggingface.co

View on Bluesky · ♥ 3 ↻ 0 ↩ 0 · 35d ago

Can try yourself via @hf.co Jobs Recipe: huggingface.co/datasets/uv-... Output data: huggingface.co/datasets/dav...

davanstrien/chronicling-america-surya-ocr · Datasets at Hugging Face huggingface.co

View on Bluesky · ♥ 8 ↻ 0 ↩ 1 · 36d ago

Their own posts

Recent commentary

I think VLM-based OCR might finally be close to working on historic newspapers! Many models I've tried before failed i.e. hallucinations, repetition loops, context overflow. Surya OCR 2 (a 650M model!) does a very good job!

View on Bluesky · ♥ 87 ↻ 14 ↩ 1 · 36d ago

NuExtract3 (4B, Apache-2.0) does OCR *and* structured extraction. Point it at a dataset of scanned index cards + a JSON schema → clean catalog JSON One command on huggingface Jobs. (or skip the schema for plain Markdown OCR) Script + dataset 👇

View on Bluesky · ♥ 39 ↻ 13 ↩ 2 · 69d ago

If libraries, archives and museums pooled their (labelled) data, they could build state-of-the-art open models for the things they actually care about! I tried a small version: one open model (NuExtract-3, 4B) fine-tuned to read archival index cards across several collections.

View on Bluesky · ♥ 28 ↻ 5 ↩ 1 · 34d ago

What could a rich ecosystem of small GLAM AI models enable? IMO: cheaper, better-fitted, more robust models. Example: I extended an existing @natlibscot.bsky.social archival card detector to 4 collections to make a more generic index card detector. Took an hour or two and minimal $

View on Bluesky · ♥ 14 ↻ 1 ↩ 1 · 56d ago

Their network

In Daniel van Strien's orbit

Center = Daniel van Strien. Left = members they follow (green edges). Right = members who follow them (blue edges). Top = mutual follows (orange edges, slightly larger). Drag any node to reposition; click to open that profile.