You want to precompute the contact sheets and serve them to users. You can encode them with VP9, mux to IVF format, and use the WebCodec API to decode them in the browser (2000B-3000B per 240x135 frame, so ~3MB/hour for a thumbnail every 4 seconds). Alternatively, you can make the contact sheets with JPEG, but there are dimension restrictions, reflow is slightly fiddly, and it doesn't exploit intra-frame compression.
I made a simple Python/Flask utility for lossless cutting that uses this to present a giant contact sheet to quickly select portions of a video to extract.
I purposely pivoted to 100% client-side extraction to achieve zero server load and a one-line integration. While it has limits with massive data, the 'plug-and-play' nature is the core value of VAM-Seek. I'd rather give people a tool they can use in 5 seconds than a high-performance system that requires 5 minutes of server config.
Doesn't that mean the client has to grab a bunch of extra data when it first opens the page, at least if the user calls up the seek feature? Since you effectively have to grab various frames from all throughout the video to generate the initial batch. It seems like it would make more sense to have server side thumbnails here as long as they're reasonably sparse and low quality.
Although I admit that one line client side integration is quite compelling.
To be honest, I struggled a lot with how to build this. I have deep respect for professional craftsmanship, yet I chose a path that involved a deep collaboration with AI.
I wrote down my internal conflict and the journey of how VAM-Seek came to be in this personal log. I’d be honored if you could read it and see what I was feeling during the process: https://haasiy.main.jp/note/blog/llm-coding-journey.html
It’s just a record of one developer trying to find a way forward.
To answer your question: VAM-Seek doesn't pre-render the entire 60 minutes. It only extracts frames for the visible grid (e.g., 24-48 thumbnails) using the browser's hardware acceleration via Canvas.
On older hardware, the bottleneck is usually the browser's video seeking speed, not the generation itself. Even on a 2012 desktop, it should populate the grid in a few seconds. If it takes longer... well, that might be your PC's way of asking for a retirement plan! ;)
However, the execution is meh. The UX is terrible (on mobile at least) and the code and documentation are an overly verbose mess. The entire project ought to fit in the size of the AI generated readme. Using AI for exploration and prototyping is fine, but you can't ship that slop mate, you need to do the polishing yourself.
Then, improving the signal to noise ratio of your project actually help “shipping the next feature”, as LLM themselves get lost in the noise they make.
Finally, if you want people to use your project, you need to show us that it's better than what they can make by themselves. And it's especially true now that AI reduces the cost of building new stuff. If you can't work with Claude to build something better that what Claude builds, your project isn't worth more than its token count.
My role was to architect the bridge between UI/UX design and the underlying video data processing. Handling frame extraction via Canvas, managing memory, and ensuring a seamless seek experience without any backend support requires a deep understanding of how these layers interact.
Simply connecting a backend to a UI might be common, but eliminating the backend entirely while maintaining the utility is a high-level engineering choice. AI was my hammer, but I was the one who designed the bridge. To say this is worth no more than its token count ignores the most difficult part: the intent and the structural simplification that makes it usable for others in a single line of code.
Ironic.