Processing hundreds of screenshots/hour forced us to optimize for token costs.
The surprise: send video, not images
- Single screenshot (1698×894): 1,812 tokens
- Same frame in video: 258 tokens (Gemini 2.5) or ~70 tokens (Gemini 3)
- Full 8-hour workday: ~$1-3
Video gives you timestamps for free and compresses well since consecutive frames are nearly identical. We keep costs down by having the LLM write short summaries while running OCR locally for text extraction.