My regular cycle is to talk informally to the CLI agent and ask it to “say back to me what you understood”, and it almost always produces a nice clean and clear version. This simultaneously works as confirmation of its understanding and also as a sort of spec which likely helps keep the agent on track.
UPDATE - just tried handy with Parakeet v3, and it works really well too, so I'll use this instead of VoiceInk for a few days. I just also discovered that turning on the "debug" UI with Cmd-shift-D shows additional options like post processing and appending trailing space.
I do, however, wonder if there is a way all these TTS tools can get to the next level. The generated text should not be just a verbatim copy of what I just said, but depending on the context, it should elaborate. For example, if my cursor is actively inside an editor/IDE with some code, my coding-related verbal prompts should actually generate the right/desired code in that IDE.
Perhaps this is a bit of combining TTS with computer-use.
I have a claude skill `/record` that runs the CLI which starts a new recording. I debug, research, etc., then say "finito" (or choose your own stopword). It outputs a markdown file with your transcribed speech interleaved with screenshots and text that you copied. You can say other keywords like "marco" and it will take a screenshot hands-free.
When the session ends, claude reads the timeline (e.g. looks at screenshots) and gets to work.
I can clean it up and push to github if anyone would get use out of it.
There’s also more recent-ish research, like https://dl.acm.org/doi/fullHtml/10.1145/3571884.3597130
I initially had a ton of keyboard shortcuts in handy for myself when I had a broken finger and was in a cast. It let me play with the simplest form of this contextual thing, as shortcuts could effectively be mapped to certain apps with very clear uses cases
That CLI bit I mentioned earlier is already possible. For instance, on macOS there’s an app called MacWhisper that can send dictation output to an OpenAI‑compatible endpoint.
P.S. The post processing that you are talking about, wouldn’t it be awesome.
https://github.com/cjpais/Handy/actions/runs/21025848728
There is also LLM post processing which can do this, and the built in dictionary feature
Is there any way to execute commands directly on Linux?
Also a feature to edit or correct already typed text would be really great.
Handy first release was June 2025, OpenWhispr a month later. Handy has ~11k GitHub stars, OpenWhispr has ~730.
Handy’s ui is so clean and minimalistic that you always know what to do or where to go. Yes, it lacks in some advanced features, but honestly, I’ve been using it for two months now and I’ve never looked back or searched for any other STT app.
The ui is well thought out, just the right amount of setting for my usage.
Incredible !
Btw, do you know what « discharging the model » does ? It’s set to never by default, tried to check if it has an impact on ram or cpu but it doesn’t seem to do anything.
How have your computing habits changed as a result of having this? When do you typically use this instead of typing on the keyboard?
- you're not a native speaker or have accent
- using airpods mic
- surroundings is noisy
- use novel words like 'claude code'
- mumble a bit
I know many people hate sites like this, but I actually like them for these use cases. You can get a quick, LLM-generated overview of the architecture, e.g. here: https://codewiki.google/github.com/cjpais/handy
If so, there should be "keep microphone on" or similar setting in the config that may help with this, alternatively, I set my microphone to my MacBook mic so that my headphones aren't involved at all and there is much less latency on activation
They also have a voice input only version if you still would like to keep your typing keyboard: https://voiceinput.futo.org/
I did find the projects "user-facing" home page [1] which was nice. I found it rather hard to find a link from that to the code on GitHub, which was surprising.
this can already be done via local llm processing the text but surely there is an easier way to do this, right