ASR output arrives in ~600 ms chunks and is messy (filler words, homophones, skipped phrases). A simple substring match breaks immediately.
The current tracker uses:
- inverted token index to find candidate windows - banded Levenshtein distance for fuzzy matching - Double Metaphone phonetic normalization - locality penalties to stay near the current position
Between ASR updates the UI speculatively advances the cursor based on measured WPM so the highlight moves smoothly.
Curious if anyone here has worked on similar real-time alignment problems.