MacWhisper [0] (the app I settled on) is conspicuously missing from your benchmarks [1]. How does it compare?
For that benchmarking table, you can use Whisper Large V3 as a stand-in for Mac Whisper and Super Whisper accuracy.
Aqua looks good and I will be testing it, but I do like that with superwhisper nothing leaves my computer unless I add AI integrations.
Side-comment of something this made me think of (again): tech builds too much for tech. I've lived in the Bay before, so I know why this happens. When you're there, everyone around you is in tech, your girlfriend is in tech, you go to parties and everyone invariably ends up talking about work, which is tech. Your frustrations are with tech tools and so are your peers', so you're constantly thinking about tech solutions applicable to tech's problems.
This seems very much marketed to SF people doing SF things ("Cursor, Gmail, Slack, even your terminal"). I wonder how much effort has gone into making this work with code editors or the terminal, even though I doubt this would a big use-case for this software if it ever became generally popular. I'd imagine the market here is much larger in education, journalism, film, accessibility, even government. Those are much more exciting demos.
I share the same sentiment. I remember thinking in college how annoying it was that I was reading low-resolution, marked-up, skewed, b&w scans of a book using Adobe Acrobat while CS concentrators were doing everything in VS Code (then brand new).
but we do think voice is actually great with Cursor. It’s also really useful in the terminal for certain things. Checking out or creating branches, for example.
"we also collect and process your voice inputs [..] We leverage this data for improvements and development [..] Sharing of your information [..] service providers [..] OpenAI" https://withaqua.com/privacy
No mention of privacy (or on prem) - so assumed it's 100% cloud.
Non-starter for me. Accuracy is important, but privacy is more so.
Hopefully a service with these capabilities will be available where the first step has the user complete a brief training session, sends that to the cloud to tailor the recognition parameters for their voice and mannerisms... then loads that locally.
given that we need the cloud, we offer zero data retention -- you can see this in the app. your concern is as much about ux and communications as it is privacy
And self-hosting real-time streaming LLMs will probably also come out at 50 cents per hour. Arguing a $120/month price for power users is probably going to be very difficult. Especially so if there is free open-source alternatives.
I’d say local is necessary for delightful product experience and the added bonus is that it ticks the privacy box
I do wish there was a mobile app though (or maybe an iOS keyboard). It would also be nice to be able to have a separate hotkey you can set up to send the output to a specific app (instead of just the active one).
Things I've learned are:
1. It works better if you're connected by Ethernet than by Wi-Fi.
2. It needs to have a longer recognition history because sometimes you hit the wrong key to end a recognition session, and it loses everything.
3. Besides the longer history, a debugging mode that records all the characters sent to the dictation box would be useful. Sometimes, I see one set of words, blink, and then it's replaced with a new recognition result. Capturing would be useful in describing what went wrong.
4. There should be a way to tell us when a new version is running. Occasionally, I've run into problems where I'm getting errors, and I can't tell if it's my speaking, my audio chain, my computer, the network, or the app.
5. Grammarly is a great add-on because it helps me correct mis-speakings and odd little errors, like too many spaces caused by starting and stopping recognition.
When Dragon Systems went through bankruptcy court, a public benefits corporation bid for the core technology because it recognized that Dragon was a critical tool for people with disabilities to function in a digital world.
In my opinion, Aqua has reached a similar status as an essential tool. Well, it doesn't fully replace Dragon for those who need command and control (yet). The recognition accuracy and smoothness are so amazing that I can't envision returning to Dragon Systems without much pain. The only thing worse would be going back to a keyboard.
Aqua Guys, don't fuck it up.
But I’ve noticed/learned that I can’t dictate written content. My brain just does not work that way at all — as I write I am constantly pausing to think, to revise, etc and it feels like a completely different part of my brain is engaged. Everything I dictated with Aqua I had to throw away and rewrite.
Has anyone had similar problems, and if so, had any success retraining themselves toward dictation? There are fleeting moments where it truly feels like it would be much faster.
It's very hard, and I wouldn't do it if I didn't have to.
(which is why I'm always perplexed by these apps which allow voice dictation or voice control, but not as a complete accessibility package. I wouldn't be using my voice if my hands worked!)
It's also critically important (and after 3-4 years of this I still regularly fail at this) to actually read what you've written, and edit it before send, because those chunks don't always line up into something that I'd consider acceptably coherent. Even for a one sentence slack message.
(also, I have a kiwi accent, and the dictation software I use is not always perfect at getting what I wanted to say on the page)
In my experience LLM can be quite forgiving when given some unfinished input and asked to expand/clean up?
1. like you mentioned, the second I start talking about something, I totally forget where I'm going, have to pause, it's like my thoughts aren't coming to me. Probably some sort of mental feedback loop plus, like you mentioned, different method of thinking.
2. in the back of my mind, I'm always self-conscious that someone is listening, so it's a privacy / being judged / being overheard feeling which adds a layer of mental feedback.
There's also not great audio clues for handling on-the-fly editing. I've tried to say "parentheses word parentheses" and it just gets written out. I've tried to say "strike that" and it gets written out. These interfaces are very 'happy path' and don't do a lot of processing (on iOS, I can say "period" and get a '.' (or ?,!) but that's about the extent).
I have had some success with long-form recording sessions which are transcribed afterwards. After getting over the short initial hump, I can brain-dump to the recording, and then trust an app like Voice Notes or Superwhisper to transcribe, and then clean up after.
The main issue I run into there, though, is that I either forget to record something (ex. a conversation that I want to review later) or there is too much friction / I don't record often enough to launch it quickly or even remember to use that workflow.
I get the same feeling with smart home stuff - it was awesome for a while to turn lights on and off with voice, but lately there's the added overhead of "did it hear me? do I need to repeat myself? What's the least amount of words I can say? Why can't I just think something into existence instead? Or have a perfect contextual interface on a physical device?"
1. The models weren't ready.
2. The interactions were often strained. Not every edit/change is easy to articulate with your voice.
If 1 had been our only problem, we might have had a hit. In reality, I think optimizing model errors allowed us to ignore some fundamental awkwardness in the experience. We've tried to rectify this with v2 by putting less emphasis on streaming for every interaction and less emphasis on commands, replacing it with context.
Hopefully it can become a tool in the toolbox.
Voice is great for whenever the limiting factor to thought is speed of typing.
I can't find any documentation on how Aqua works, or how it compares, so I'm not sure it's meant to be a replacement / competitor to Talon? What are you configuring? How are you telling it that you like "genz" style in Slack? Can I create custom configurations / macros?
One thing I like about Talon is it's not magic. Which maybe is not what you're going for. But I am giving it explicit commands that I know it will understand (if it understands my accent obvs), as opposed to guessing and constructing a human language vague sentence and hope that an llm will work it out. Which means it feels like something I can actually become fast with, and build up muscle memory of.
Also that it's completely offline, so I can actually run it on a work computer without my security folks freaking out.
You can customize Aqua using custom instructions, similar to ChatGPT custom instructions, and get some Talon functionality from it:
In my own, I have:
1. Breaking the paragraphs with three or four sentences.
2. Don't start a sentence with "and".
3. Use lowercase in Slack and iMessage.
4. Here are some common terminal commands...
Users have different preferences on the text format they input into different apps. Aqua is able to pick up on these explicit and implicit preferences across apps – but no "open XYZ app" commands, yea
A nice open-source alternative is VoiceInk, check it out: https://github.com/Beingpax/VoiceInk
do you also plan to open-source part of your platform?
This is obviously a lie. If this was true, all the inference provider companies would go to zero. I support open-source as much as the next guy here, but it's obvious that the local version will be slower or break more often. Like, come on guys. Be real.
To illustrate this, M4 Max chips do 38 TOPS in FP8. An NVIDIA H100 does 4,000 TOPS.
Prakash if you're going to bot our replies, at least make it believable.
I have both apps open. The STT seems to be faster with VoiceInk. Like it is instant. I can send you a video if you want.
I am sorry. I did not want your product to look bad. You are right you still need to offload the llm part to openrouter and the like if you want this to be fast too. However, having the ability to switch AI on/off on demand and context aware with custom prompts is perfect. It can use ollama too. Yes this will be much slower but local. Best off both worlds. No subscription, even if you use cloud ai.
I'm skeptical of the claim that local is literally faster, but it's not impossible in the way you suggest.
For me with a small (but very accurate) whisper model it can be even faster than AquaVoice, and works locally (which is really important)
I wouldn't feel comfortable if someone were looking over my shoulder while I'm typing at a coffee shop.
I am not your customer.
Aqua Voice
What is the first recorded eclipse in human history? I'm not asking when the first one occurred, but the first written record we have of an eclipse.
Windows Voice Typing (v11 24H2, Dell XPS 13 9340) What is the first recorded eclipse in human history i'm not asking 1 like the first occurred but the first ridden record we have of an eclipse
Windows mistakes were:-"1" should be "when"
-"ridden" should be "written"
-No punctuation
My experience with it has been overall positive but mixed. I enjoy using it for dictation, but I found that issuing editing commands and having them recognized/executed often took a lot longer than making an edit myself (which I can't do while in dictation mode).
But as a paying customer, seeing you go in this direction is somewhat sad/frustrating. You're abandoning the product I use, and you're saying that if I want to see my platform supported, I or someone from the community has to provide it- for a fully proprietary paid application.
I understand that I'm a minority user, but it's a bit disappointing to read this.
We do plan to support Linux. This was probably a little bit of a blind spot for us - not realizing that anyone running a Linux desktop doesn't even have system voice tool to fall back on.
btw, grats!
We're faster, more accurate, and have a streaming option. Aqua can go from key-up to paste in as little as 450ms. Flow was closer to 1000 in our tests.
Overall, you'll notice we make a few more tweaks to the output than Wisprflow.
For example, Aqua + Cursor is very powerful - we syntax highlight your transcript. The easiest way to see this is to use streaming mode (double press Fn) + deep context + cursor and try asking it to change something.
This also works in other "context rich" environments.
It has been a godsend in terms of increasing my productivity because I no longer have to type. I think your product's accuracy and latency shortening just make this even better. I often use it and then find out, "Hey, I need to make some changes," and I need to re-edit some of the stuff, which reduces the WPM productivity amount. So I think accuracy is definitely key here. Key metric to differentiate a product.
I am pushing this to other colleagues to get them to adopt. One challenge people are saying is that. One is that some people may not be as organized (you know, they might be a lot more organizationally structured in their mind). So for them, they're having trouble - they'd like to write things out, and by the time if things go out of their mouth, you know it's already formulated logical thought. Whereas you know people like me are a lot more verbal vomit type of person. For me it's huge because I say a lot of um like in all the other things I just dump stuff out and then organize it later.
Whereas other people organize stuff in their brain and then dump the information out. So people who do a lot of coordination and just you know so I feel like this could be two different segments to take into account.
Another one that's been fantastic is that we have multilingual colleagues who are speaking in Mandarin or something else and then they speak it and then ask Flow to be sent to translate it to a different language. That part I think has been fantastic.
I think the ability to edit what you wrote with AI is going to be the next key feature. Providing the context in the window is all wiithin the conversation right? For example, you just ask after because what you write out is not the final and you need to do a lot of editing and formatting. Sometimes when you say too much stuff, it's just like a huge jumble paragraph with a lot of fluff words. Make it clear, concise, trim non-effective words. I think those are a key feature because it's not about your productivity, it's about other people being able to ingest your information efficiently. At least that's what I look at from a managerial perspective.
To give you an example, everything I laid out above came from dictation. You can see how this is inefficient. There's a lot of inefficiencies here.
Another feature that would be great is actually being able to have a conversation with an AI model first and then refine the output iteratively until you're ready and then pipe all that over. The ability to have a chat is very good or do this all through voice.