Claude-Real-Video feeds scene-aware frames to text-only LLMs
TL;DR
- Claude-Real-Video is an MIT-licensed Python tool that uses ffmpeg scene detection and pixel-diff dedup to hand any text-only LLM keyframes, a transcript, and a manifest.
- The submitter claims a 10-minute presentation collapses from roughly 600 near-identical frames to 5 to 15 keyframes, framed as 90%+ token savings with better comprehension.
- Hacker News commenters push back that Gemini's native video is cheaper at around $0.24 per hour on 3.1 flash lite, and that frames still get sent to Anthropic.
A quiet utility on the Hacker News front page this week does something that seems obvious in hindsight, then does it well. Claude-Real-Video is a Python package that decodes a video into a handful of scene-change keyframes, transcribes the audio with Whisper, and hands the LLM a folder it can actually read. The pitch, in the author's own framing, is that Claude will not take a video file at all, and Gemini samples frames at a fixed interval of 1 fps by default, so fast cuts slip past. This tool picks frames where something actually changed.
The mechanics are unfussy. A single ffmpeg pass finds scene changes at a default sensitivity of 0.30, a sliding window of four frames catches pixel-diff duplicates at an 8% threshold, and a hard cap of 150 keyframes prevents runaway sampling. Input comes from yt-dlp for YouTube, Instagram and TikTok URLs, or from a local file, and the output is a set of JPGs plus transcript.txt and MANIFEST.txt. The submitter's claim on Hacker News, where the project sits at 134 points and 41 comments, is that a 10-minute presentation collapses from around 600 near-identical frames to 5 to 15 keyframes, which they describe as 90%+ token savings with better comprehension.
The commenter response is worth reading before you install it. One thread argues the whole approach is 'pretty terribly expensive' compared with Gemini's native video handling, and cites a figure of about $0.24 per hour on Gemini 3.1 flash lite for the same job. Another points at the obvious tension in the local-first framing: the frames get sent to Anthropic the moment you paste them into Claude. And a third notes that motion and object permanence are not things a model can infer from a set of still images, no matter how well chosen they are.
The honest caveat is that a smart-frame pipeline solves a context-economy problem, not a video-understanding problem, and it is not going to close the gap with a model trained on video end to end. What the reporting doesn't give you is a rigorous cost comparison against native video models for the same task, or numbers on how the scene detector handles ugly cases like surveillance footage or fast sports. The reason the repository is still worth a look is that the pattern is portable. Any text-only model, on any provider, gets a plausible shot at video reasoning with an MIT-licensed pip install and a working ffmpeg, and that is a useful escape hatch when the video-native option is not on the table.
Originally reported by github.com
Read the original article →Original headline: HN Front Page: 'Claude-Real-Video' Open-Source Shim Lets Any Text-Only LLM Watch a Video Frame-by-Frame — 133 Points, 41 Comments