Too bad that tool no longer seems to be developed. Looking for something similar. But it's really nice to see what's possible with local models.
By "any GPU" you mean a physical, dedicated GPU card, right?
That's not a small requirement, especially on Macs.
https://news.ycombinator.com/item?id=46640855
I find the model works surprisingly well and in my opinion surpasses all other models I've tried. Finally a model that can mostly understand my not-so-perfect English and handle language switching mid sentence (compare that to Gemini's voice input, which is literally THE WORST, always trying to transcribe in the wrong language and even if the language is correct produces the uttermost crap imaginable).
The top feature is the per-app custom settings - you can peak different models and instructions for different apps and websites.
- I use the Parakeet fast model when working with Claude Code (VS Code app). - And I use a smart one when I draft notes in Obsidian. I have a prompt to clean up my rambling and format the result with proper Markdown, very convenient.
One more cool thing is that it allows me to use LLMs with audio input modalities directly (not as text post-processing). e.g. It sends the audio to Gemini and prompts it to transcribe, format, etc., in one run. I find it a bit slow to work with CC, but it is the absolute best model in terms of accuracy, understanding, and formatting. It is the only model I trust to understand what I meant and produce the correct result, even when I use multiple languages, tech terms, etc.
But a few weeks ago someone on HN pointed me to Hex, which also supports Parakeet-V3 , and incredibly enough, is even faster than Handy because it’s a native MacOS-only app that leverages CoreML/Neural Engine for extremely quick transcriptions. Long ramblings transcribed in under a second!
It’s now my favorite fully local STT for MacOS:
My comment on this from a month back: https://news.ycombinator.com/item?id=46637040
I think the biggest difference between FreeFlow and Handy is that FreeFlow implements what Monologue calls "deep context", where it post-processes the raw transcription with context from your currently open window.
This fixes misspelled names if you're replying to an email / makes sure technical terms are spelled right / etc.
The original hope for FreeFlow was for it to use all local models like Handy does, but with the post-processing step the pipeline took 5-10 seconds instead of <1 second with Groq.
Thank you for making Handy! It looks amazing and I wish I found it before making FreeFlow.
You can go to Settings > Run Logs in FreeFlow to see the full pipeline ran on each request with the exact prompt and LLM response to see exactly what is sent / returned.
Surprisingly, it produced a better output (at least I liked its version) than the recommended but heavy model (Parakeet V3 @ 478 MB).
F12 -> sox for recording -> temp.wav -> faster-whisper -> pbcopy -> notify-send to know what’s happening
https://github.com/sathish316/soupawhisper
I found a Linux version with a similar workflow and forked it to build the Mac version. It look less than 15 mins to ask Claude to modify it as per my needs.
F12 Press → arecord (ALSA) → temp.wav → faster-whisper → xclip + xdotool
https://github.com/ksred/soupawhisper
Thanks to faster-whisper and local models using quantization, I use it in all places where I was previously using Superwhisper in Docs, Terminal etc.
Edit: Ah but Parakeet I think isn’t available for free. But very worthwhile single purchase app nonetheless!
I had previously used Hyprnote to record meetings in this way - and indeed I still use that as a backup, it's a great free option - but the meeting prompting to record and better transcription offered by Macwhisper is a much better experience.
Been a power user of SuperWhisper and Wispr Flow for a long time and eventually decided to unify those flows - memos & dictations, everything is a file and local first, BYOK
If you are willing to use a service for transcriptions, Mistral (which is also European) works rather nicely if they support your language https://docs.mistral.ai/capabilities/audio_transcription#tra...
https://github.com/kitlangton/Hex
for me it strikes the balance of good, fast, and cheap for everyday transcription. macwhisper is overkill, superwhisper too clever, and handy too buggy. hex fits just right for me (so far)
I build https://github.com/bwarzecha/Axii to keep EVERYTHING locally and be fully open source - can be easily used at any company. No data send anywhere.
Chatterbox TTS (from Resemble AI) does the voice generation, WhisperX gives word-level timestamps so you can click any word to jump, and FastAPI ties it all together with SSE streaming so audio starts playing before the whole thing is done generating.
There's a ~5s buffer up front while the first chunk generates, but after that each chunk streams in faster than realtime. So playback rarely stalls.
It took about 4 hours today... wild.
And then I set the button right below that as the enter key so it feels mostly handsoff the keyboard.
My take for X11 Linux systems. Small and low dependency except for the model download.
https://github.com/corlinp/voibe
I do see the name has since been taken by a paid service... shame.
If you do that, the total pipeline takes too long for the UX to be good (5-10 seconds per transcription instead of <1s). I also had concerns around battery life.
Some day!
It’s free and offline
Just use handy: https://github.com/cjpais/Handy
native app uses Parakeet (v2 or V3) on iOS
Won't be free when xAI starts charging.