My main gripe with Wispr Flow is that it's slow and does the entire transcription in one pass after you finish speaking. Does this stream and transcribe as you talk?
I really want to see the transcription in progress while I'm speaking.
The issues I see are: - Transcription models use beam search to choose the most likely words at each step, taking into account the surrounding words. The accuracy will drop a lot if you pick each top word individually as it’s spoken. The surrounding context matters a lot. - To that point, transcription models do get things wrong (i.e. "best" instead of "test"). The LLM post-processing can help here, by taking in the top-N hypotheses from the transcription mode and determining which makes the most sense (i.e. "run the tests", not "run the bests"), adding another layer of semantic understanding. Again, the surrounding context really matters here.
Do you need each word to stream individually? Or would it be sufficient for short phrases to stream?
The MLX inference is so fast that you could accomplish something like the latter by releasing and re-pressing the shortcut every 5-10 words. It so fast it honestly feels like streaming. In practice, I tend to do something like this anyway, because I find it easier to review shorter transcripts!