Under the hood: Recapio grabs the transcript (prioritizing manual captions over auto-generated ones) and uses an LLM to generate structured summaries with timestamped citations.
The Engineering Challenge: The biggest headache was 'hallucination drift'—where the AI summary claims a topic starts at 10:00, but it actually starts at 10:45. I solved this by implementing a chunking strategy that overlaps context windows, forcing the model to verify timestamps against the raw text segments before outputting the link.
It’s a work in progress. I'm curious if anyone has better strategies for handling the lack of punctuation in auto-generated YouTube captions
While building this, I realized most YouTube transcript APIs were either overpriced or lacked good integration for LLM workflows.
So I spun out the backend as a standalone API: transcriptapi.com
The cool part is I added native MCP (Model Context Protocol) support. If you use Claude Desktop or similar agents, you can drop this in as a tool to fetch full video context directly into your chat window without copy-pasting.