My regular cycle is to talk informally to the CLI agent and ask it to “say back to me what you understood”, and it almost always produces a nice clean and clear version. This simultaneously works as confirmation of its understanding and also as a sort of spec which likely helps keep the agent on track.
UPDATE - just tried handy with Parakeet v3, and it works really well too, so I'll use this instead of VoiceInk for a few days. I just also discovered that turning on the "debug" UI with Cmd-shift-D shows additional options like post processing and appending trailing space.
I want to be able to say things like "cd ~/projects" or "git push --force".
Likewise "cd home slash projects" into "cd ~/projects".
Maybe with some fine tuning, maybe without.
I do, however, wonder if there is a way all these TTS tools can get to the next level. The generated text should not be just a verbatim copy of what I just said, but depending on the context, it should elaborate. For example, if my cursor is actively inside an editor/IDE with some code, my coding-related verbal prompts should actually generate the right/desired code in that IDE.
Perhaps this is a bit of combining TTS with computer-use.
I have a claude skill `/record` that runs the CLI which starts a new recording. I debug, research, etc., then say "finito" (or choose your own stopword). It outputs a markdown file with your transcribed speech interleaved with screenshots and text that you copied. You can say other keywords like "marco" and it will take a screenshot hands-free.
When the session ends, claude reads the timeline (e.g. looks at screenshots) and gets to work.
I can clean it up and push to github if anyone would get use out of it.
I initially had a ton of keyboard shortcuts in handy for myself when I had a broken finger and was in a cast. It let me play with the simplest form of this contextual thing, as shortcuts could effectively be mapped to certain apps with very clear uses cases
There’s also more recent-ish research, like https://dl.acm.org/doi/fullHtml/10.1145/3571884.3597130
That CLI bit I mentioned earlier is already possible. For instance, on macOS there’s an app called MacWhisper that can send dictation output to an OpenAI‑compatible endpoint.
It uses a 'character-typing' method instead of clipboard injection, so it's compatible with pretty much anything remote. Also kept it super lightweight (<50MB RAM) for Windows users who don't want to run a full local server stack.
Cool to see Handy using the newer models—local voice tech is finally getting good.
P.S. The post processing that you are talking about, wouldn’t it be awesome.
There is also Post Processing where you can rerun the output through an LLM and refine it, which is the closest to what Wispr Flow is doing.
This can be found in the debug menu in the GUI (Cmd + Shift + D).
https://github.com/cjpais/Handy/actions/runs/21025848728
There is also LLM post processing which can do this, and the built in dictionary feature
Superwhisper — Been using it a long time. It's paid with a lifetime subscription available. Tons of features. Language models are built right in without additional charge. Solo dev is epic; may defer upgrades to avoid occasional bugs/regressions (hey, it's complex software).
Trying each for a few minutes:
Hex — Feels the leanest (& cleanest) free options mentioned for Mac in this thread.
Fluid Voice — Offers a unique feature, a real-time view of your speech as you talk! Superwhisper has this, but only with an online model. (You can't see your entire transcript in Fluid, though. The recording window view is limited to about one sentence at a time--of course you do see everything when you complete your dictation.)
Handy — Pink and cute. I like the history window. As far as clipboard handling goes, I might note that the "don't modify clipboard" setting is more of a "restore clipboard" setting. Though it doesn't need as many permissions as Hex because it's willing to move clipboard items around a bit, if I'm not mistaken.
Note Hex seems to be upset about me installing all the others... lots of restarting in between installs all around. Each has something to offer.
---
Big shout out to Nvidia open-sourcing Parakeet--all of these apps are lightning fast.
Also I'm partial to being able to stream transcriptions to the cursor into any field, or at least view live like Fluid (or superwhisper online). I know it's complex b/c models transcribe the whole file for accuracy. (I'm OK with seeing a lower quality transcript realtime and waiting a second for the higher-quality version to paste at the end.)
Handy first release was June 2025, OpenWhispr a month later. Handy has ~11k GitHub stars, OpenWhispr has ~730.
I built OW because I was tired of paying for WisprFlow. I'd say it is more flexible by design: Whisper.cpp (CPU + GPU) for super fast local transcription, Parakeet in progress, local or cloud LLMs for cleanup (Qwen, Mistral, Gemini, Anthropic, OpenAI, Groq etc.), and bring-your-own API keys!
Handy is more streamlined for sure!
Would love any feedback :)
Handy’s ui is so clean and minimalistic that you always know what to do or where to go. Yes, it lacks in some advanced features, but honestly, I’ve been using it for two months now and I’ve never looked back or searched for any other STT app.
The ui is well thought out, just the right amount of setting for my usage.
Incredible !
Btw, do you know what « discharging the model » does ? It’s set to never by default, tried to check if it has an impact on ram or cpu but it doesn’t seem to do anything.
How have your computing habits changed as a result of having this? When do you typically use this instead of typing on the keyboard?
If so, there should be "keep microphone on" or similar setting in the config that may help with this, alternatively, I set my microphone to my MacBook mic so that my headphones aren't involved at all and there is much less latency on activation
On a Mac I definitely recommend using the internal mic even if wearing airpods
- you're not a native speaker or have accent
- using airpods mic
- surroundings is noisy
- use novel words like 'claude code'
- mumble a bit
I did find the projects "user-facing" home page [1] which was nice. I found it rather hard to find a link from that to the code on GitHub, which was surprising.
They also have a voice input only version if you still would like to keep your typing keyboard: https://voiceinput.futo.org/
Is there any way to execute commands directly on Linux?
Also a feature to edit or correct already typed text would be really great.
I know many people hate sites like this, but I actually like them for these use cases. You can get a quick, LLM-generated overview of the architecture, e.g. here: https://codewiki.google/github.com/cjpais/handy
this can already be done via local llm processing the text but surely there is an easier way to do this, right