Show HN: Aqua Voice 2 – Fast Voice Input for Mac and Windows(withaqua.com)

140 pointsby the_king6 months ago29 comments

idk16 months ago
I’ve been using this for some time and I have to say it is fantastic. I’m intentionally not writing this with Aqua but by hand and it is taking so much longer. This to me feels like what Apple Intelligence could be, it is so much better than stuff all of the big tech is doing. For example, if you tell Siri voice dictation to go back and delete something what Siri will do is just write out “go back and delete something“ also if you tell Siri to go back and spell a name differently all Siri will do is write out the letters that you said to go back and type out. Honestly, for voice dictation software it feels like travelling to another planet in terms of improvement.
niel6 months ago
Real-time text output à la Apple Dictation with the accuracy of Whisper is something I've been looking for recently - I'll definitely give Aqua a spin.
MacWhisper [0] (the app I settled on) is conspicuously missing from your benchmarks [1]. How does it compare?
[0]: https://goodsnooze.gumroad.com/l/macwhisper
[1]: https://withaqua.com/blog/benchmark-nov-2024
- the_king6 months ago
  We're more accurate and much faster than Mac Whisper, even their strongest model (Whisper Cpp Large V3).
  For that benchmarking table, you can use Whisper Large V3 as a stand-in for Mac Whisper and Super Whisper accuracy.
  - pbowyer6 months ago
    I've been using Superwhisperapp for over a year and I get nothing like the error level your comparison table suggests. Which model were you using with it?
    Aqua looks good and I will be testing it, but I do like that with superwhisper nothing leaves my computer unless I add AI integrations.
aylmao6 months ago
This is super impressive, great job!
Side-comment of something this made me think of (again): tech builds too much for tech. I've lived in the Bay before, so I know why this happens. When you're there, everyone around you is in tech, your girlfriend is in tech, you go to parties and everyone invariably ends up talking about work, which is tech. Your frustrations are with tech tools and so are your peers', so you're constantly thinking about tech solutions applicable to tech's problems.
This seems very much marketed to SF people doing SF things ("Cursor, Gmail, Slack, even your terminal"). I wonder how much effort has gone into making this work with code editors or the terminal, even though I doubt this would a big use-case for this software if it ever became generally popular. I'd imagine the market here is much larger in education, journalism, film, accessibility, even government. Those are much more exciting demos.
- the_king6 months ago
  thanks!
  I share the same sentiment. I remember thinking in college how annoying it was that I was reading low-resolution, marked-up, skewed, b&w scans of a book using Adobe Acrobat while CS concentrators were doing everything in VS Code (then brand new).
  but we do think voice is actually great with Cursor. It’s also really useful in the terminal for certain things. Checking out or creating branches, for example.
fxtentacle6 months ago
This looks like it'll slurp up all your data and upload it into a cloud. Thanks, no. I want privacy, offline mode and source code for something as crucial to system security as an input method.
"we also collect and process your voice inputs [..] We leverage this data for improvements and development [..] Sharing of your information [..] service providers [..] OpenAI" https://withaqua.com/privacy
- FloatArtifact6 months ago
  Local inference only is an absolute requirement. It's not even really all that accessible if it's online only. I can say this as someone that's used over 20000 hours worth of voice dictation and computer control.
- canada_dry6 months ago
  First thing I looked for and read: the FAQ.
  No mention of privacy (or on prem) - so assumed it's 100% cloud.
  Non-starter for me. Accuracy is important, but privacy is more so.
  Hopefully a service with these capabilities will be available where the first step has the user complete a brief training session, sends that to the cloud to tailor the recognition parameters for their voice and mannerisms... then loads that locally.
  - oulipo6 months ago
    A similar but offline tool is VoiceInk, it's also open-source so you can extend it
- pokstad6 months ago
  This should be on the FAQ. I was trying to find out if it was 100% processed locally.
- jmcintire16 months ago
  fair point. offline+local would be ideal, but as it stands we can't run asr and an llm locally at the speed that is required to provide the level of service we want to.
  given that we need the cloud, we offer zero data retention -- you can see this in the app. your concern is as much about ux and communications as it is privacy
  - fxtentacle6 months ago
    The problem if you actually need the cloud is that it kind of completely destroys your business model. OpenAI is bleeding money every month because they massively subsidize the hosting cost of their models. But eventually they will have to post a profit. And then if they know that your product is completely dependent on their API, they can milk you until there's no profits left for you.
    And self-hosting real-time streaming LLMs will probably also come out at 50 cents per hour. Arguing a $120/month price for power users is probably going to be very difficult. Especially so if there is free open-source alternatives.
  - mrtesthah6 months ago
    MacWhisper does realtime system-wide dictation on your local machine (among other things). Just a one-time fee for an app you download -- the way shareware is supposed to be. Of course it doesn't use MoE transcription with 6 models like Aqua Voice, but if you guys expect to be acquired by Apple (that is your exit strategy, right?), you're going to need better guarantees of privacy than "we don't log".
    shinycode6 months ago
    I downloaded the turbo whisper model optimized for Mac, created a python script that get the mic input and paste the result. The python script is LLM generated and it works with pushing a key. For 80% of the functionality for free and done locally.
- toddmorey6 months ago
  And man it's another monthly subscription. I'm not mad at them for finding a gap in the market and putting a business around it. I'm mad at Apple for leaving that gap... hopefully built in voice dictation improves quickly.
  - FireBeyond6 months ago
    Is there a gap in the market? It's being rapidly filled with the likes of MacWhisper, etc., which offer local-only, one-off pricing.
  - pablopeniche6 months ago
    "hopefully built in voice dictation improves quickly." I would not hold my breath on that one lol
- jackthetab6 months ago
  Agreed.
  This is where I bounce (out of this discussion).
- thmsmlr6 months ago
  I totally agree, I created BetterDictation (.com) exactly because of that. Offline was a super important requirement for me.
jrvarela566 months ago
Feedback: I use MacWhisper and Tiny wisperkit model (english only) is way faster than any cloud service on my M1 macbook pro.
I’d say local is necessary for delightful product experience and the added bonus is that it ticks the privacy box
- 6 months ago
  undefined
- brianjking6 months ago
  How much ram is in your m1?
  - jrvarela566 months ago
    16gb
    marcogarces6 months ago
    Pesant! (just joking); Mine is 96GB!
alxlu6 months ago
I’ve been using this for a while now and I really enjoy it. I ran into a semi-obscure bug and emailed them and they basically fixed it the same day.
I do wish there was a mobile app though (or maybe an iOS keyboard). It would also be nice to be able to have a separate hotkey you can set up to send the output to a specific app (instead of just the active one).
- the_king6 months ago
  thanks! We're working on iOS, but it's tough to get the ergos right given all of Apple's restrictions and neglected APIs.
  - polishdude206 months ago
    Android app please!
- pablopeniche6 months ago
  <3
rkagerer6 months ago
You mentioned it "lives on your desktop". How does licensing work, and can you install and use it on a machine without internet access?
rickydroll6 months ago
I've been using Aqua since it was announced on HNN. I've survived the teething pains by using a mixture of Aqua and Dragon, depending on what I was doing. With this new Windows app, I've given up using Dragon for anything.
Things I've learned are:
1. It works better if you're connected by Ethernet than by Wi-Fi.
2. It needs to have a longer recognition history because sometimes you hit the wrong key to end a recognition session, and it loses everything.
3. Besides the longer history, a debugging mode that records all the characters sent to the dictation box would be useful. Sometimes, I see one set of words, blink, and then it's replaced with a new recognition result. Capturing would be useful in describing what went wrong.
4. There should be a way to tell us when a new version is running. Occasionally, I've run into problems where I'm getting errors, and I can't tell if it's my speaking, my audio chain, my computer, the network, or the app.
5. Grammarly is a great add-on because it helps me correct mis-speakings and odd little errors, like too many spaces caused by starting and stopping recognition.
When Dragon Systems went through bankruptcy court, a public benefits corporation bid for the core technology because it recognized that Dragon was a critical tool for people with disabilities to function in a digital world.
In my opinion, Aqua has reached a similar status as an essential tool. Well, it doesn't fully replace Dragon for those who need command and control (yet). The recognition accuracy and smoothness are so amazing that I can't envision returning to Dragon Systems without much pain. The only thing worse would be going back to a keyboard.
Aqua Guys, don't fuck it up.
replete6 months ago
Product/UI looks good. Nice job. I would pay for a completely offline version of this, cloud voice data is non-starter for me though unfortunately
- voltaireodactyl6 months ago
  Check out MacWhisper which is one time payment and does this among many other things.
willwade6 months ago
You’re real market you need to go hard on is the assistive tech market. You know the biggest companies in this space are those solving problems for dyslexia where govt grants in eg UK fund pretty much all their work? I had an access to work assessment and they recommend like sweets stuff from texthelp. It’s then paid for by the government following these assessments. But it’s crap. It literally is a crap tool for adhd or dyslexia because these users literally CANT remember or deal with barriers like learning how to dictate correctly. Aqua voice solves this. I’m your biggest fan. I recommend it in my AT assessments all the time :)
- waveringana6 months ago
  yes I really hope a lot of these ML startups check out the history of ML tech a bit more because so many accessibility tools are built via ML but theyve been abandoned
adamesque6 months ago
I was very delighted by Aqua v1, which felt like magic at first.
But I’ve noticed/learned that I can’t dictate written content. My brain just does not work that way at all — as I write I am constantly pausing to think, to revise, etc and it feels like a completely different part of my brain is engaged. Everything I dictated with Aqua I had to throw away and rewrite.
Has anyone had similar problems, and if so, had any success retraining themselves toward dictation? There are fleeting moments where it truly feels like it would be much faster.
- SCdF6 months ago
  I use my (work) computer entirely with my voice, and it takes a lot of effort to work out what to actually write and to not ramble. Like you I've found that it's better to throw out words in sort of half sentence chunks, to give your brain time to work out what the next chunk is.
  It's very hard, and I wouldn't do it if I didn't have to.
  (which is why I'm always perplexed by these apps which allow voice dictation or voice control, but not as a complete accessibility package. I wouldn't be using my voice if my hands worked!)
  It's also critically important (and after 3-4 years of this I still regularly fail at this) to actually read what you've written, and edit it before send, because those chunks don't always line up into something that I'd consider acceptably coherent. Even for a one sentence slack message.
  (also, I have a kiwi accent, and the dictation software I use is not always perfect at getting what I wanted to say on the page)
  - e12e6 months ago
    Curious about your current setup, and if maybe adding a macro/functionality to clean up input via an LLM would help?
    In my experience LLM can be quite forgiving when given some unfinished input and asked to expand/clean up?
- noahjk6 months ago
  Same here. My two biggest hurdles are:
  1. like you mentioned, the second I start talking about something, I totally forget where I'm going, have to pause, it's like my thoughts aren't coming to me. Probably some sort of mental feedback loop plus, like you mentioned, different method of thinking.
  2. in the back of my mind, I'm always self-conscious that someone is listening, so it's a privacy / being judged / being overheard feeling which adds a layer of mental feedback.
  There's also not great audio clues for handling on-the-fly editing. I've tried to say "parentheses word parentheses" and it just gets written out. I've tried to say "strike that" and it gets written out. These interfaces are very 'happy path' and don't do a lot of processing (on iOS, I can say "period" and get a '.' (or ?,!) but that's about the extent).
  I have had some success with long-form recording sessions which are transcribed afterwards. After getting over the short initial hump, I can brain-dump to the recording, and then trust an app like Voice Notes or Superwhisper to transcribe, and then clean up after.
  The main issue I run into there, though, is that I either forget to record something (ex. a conversation that I want to review later) or there is too much friction / I don't record often enough to launch it quickly or even remember to use that workflow.
  I get the same feeling with smart home stuff - it was awesome for a while to turn lights on and off with voice, but lately there's the added overhead of "did it hear me? do I need to repeat myself? What's the least amount of words I can say? Why can't I just think something into existence instead? Or have a perfect contextual interface on a physical device?"
- the_king6 months ago
  I think Aqua v1 had two problems:
  1. The models weren't ready.
  2. The interactions were often strained. Not every edit/change is easy to articulate with your voice.
  If 1 had been our only problem, we might have had a hit. In reality, I think optimizing model errors allowed us to ignore some fundamental awkwardness in the experience. We've tried to rectify this with v2 by putting less emphasis on streaming for every interaction and less emphasis on commands, replacing it with context.
  Hopefully it can become a tool in the toolbox.
  - adamesque6 months ago
    Looking forward to giving it another try!
- 6 months ago
  undefined
- jmcintire16 months ago
  Imo it is a question of right tool for the right job, adjusted for differences between people. For me, the use case that made our product click was prompting Cursor while coding. Then I wanted to use it whenever I talked to chatgpt -- it's much faster to talk and then read, and repeat.
  Voice is great for whenever the limiting factor to thought is speed of typing.
- cloogshicer6 months ago
  I'm exactly the same. Aqua is so incredible and I really tried to like it, but I just can't get my brain to think of what I want to say first, I have to pause to think constantly.
SCdF6 months ago
I currently use Talon, which I note is not in your benchmarks.
I can't find any documentation on how Aqua works, or how it compares, so I'm not sure it's meant to be a replacement / competitor to Talon? What are you configuring? How are you telling it that you like "genz" style in Slack? Can I create custom configurations / macros?
One thing I like about Talon is it's not magic. Which maybe is not what you're going for. But I am giving it explicit commands that I know it will understand (if it understands my accent obvs), as opposed to guessing and constructing a human language vague sentence and hope that an llm will work it out. Which means it feels like something I can actually become fast with, and build up muscle memory of.
Also that it's completely offline, so I can actually run it on a work computer without my security folks freaking out.
- the_king6 months ago
  We're building something different, but there is some overlap. Aqua is built for max speed, while keeping accuracy high. To achieve that, inference runs in a datacenter (for now).
  You can customize Aqua using custom instructions, similar to ChatGPT custom instructions, and get some Talon functionality from it:
  In my own, I have:
  1. Breaking the paragraphs with three or four sentences.
  2. Don't start a sentence with "and".
  3. Use lowercase in Slack and iMessage.
  4. Here are some common terminal commands...
- willwade6 months ago
  Aqua voice is nothing like talon. I wouldn’t bother trying to compare. It’s a dictation tool. Just entry. Not commands. But it’s bloody impressive. You don’t need to learn anything - you just talk like you would talk to someone across the way from you
  - SCdF6 months ago
    Oh, from the video I got the impression it was more than that, based on it recognising app contexts and the like. I guess that's mostly just icing on the cake for the core dictation part.
    pablopeniche6 months ago
    >recognizing app contexts
    Users have different preferences on the text format they input into different apps. Aqua is able to pick up on these explicit and implicit preferences across apps – but no "open XYZ app" commands, yea
TylerE6 months ago
I will have to look into this. I am currently in the process of going on disability as I cannot work due to (amongst other things) carpal and cubical tunnel in both arms.
6 months ago
undefined
6 months ago
undefined
oulipo6 months ago
Interesting!
A nice open-source alternative is VoiceInk, check it out: https://github.com/Beingpax/VoiceInk
do you also plan to open-source part of your platform?
- razemio6 months ago
  I just tried it on a M4 Max MacBook Pro. When you have such a processor, it seems to be even faster than Aqua Voice 2, does more, optional supports open router AND is open source? Thank you so much for the recommendation!
  - pablopeniche6 months ago
    >it seems to be even faster >runs locally
    This is obviously a lie. If this was true, all the inference provider companies would go to zero. I support open-source as much as the next guy here, but it's obvious that the local version will be slower or break more often. Like, come on guys. Be real.
    To illustrate this, M4 Max chips do 38 TOPS in FP8. An NVIDIA H100 does 4,000 TOPS.
    Prakash if you're going to bot our replies, at least make it believable.
    razemio6 months ago
    I am not Parkash. Just check my profile. I am not a bot. github.com / razem-io. I checked his youtube videos. He seems to lack presentation skills but his app is very usable at its current state.
    I have both apps open. The STT seems to be faster with VoiceInk. Like it is instant. I can send you a video if you want.
    I am sorry. I did not want your product to look bad. You are right you still need to offload the llm part to openrouter and the like if you want this to be fast too. However, having the ability to switch AI on/off on demand and context aware with custom prompts is perfect. It can use ollama too. Yes this will be much slower but local. Best off both worlds. No subscription, even if you use cloud ai.
    joshjob426 months ago
    I mean, it's just doing STT, it's not trying to run inference on a frontier model, and if you have to connect to a remote datacenter vs your own machine that could add latency that slows down the experience. There are also tradeoffs in deployment in terms of latency vs throughput that doesn't occur on your local machine that only ever is doing STT on one thing at a time.
    I'm skeptical of the claim that local is literally faster, but it's not impossible in the way you suggest.
    pablopeniche6 months ago
    If you watch the video you'll see that Aqua does more than just STT and why we need the frontier models. I think if you want just STT of what you said verbatim, then yea any local version of Whisper is great. If you want the enhancements shown in the video, you need the frontier models. And doing that locally is slow.
    6 months ago
    undefined
  - oulipo6 months ago
    And it processes stuff locally on your device!
- fintechie6 months ago
  I've been using this one with Cursor the past few months...
  https://github.com/foges/whisper-dictation
- pablopeniche6 months ago
  Just tried it and it crashed
  - oulipo6 months ago
    Then submit a report, but for 100s of people using the open-source, free and extensible version, it works great!
    For me with a small (but very accurate) whisper model it can be even faster than AquaVoice, and works locally (which is really important)
  - 6 months ago
    undefined
roland_kovacs6 months ago
Hey guys, great idea! Most of the apps already have voice recognition. Do you think about serving a niche where this feature is not existing? Also, the data protection part is unclear to me, I don't want everything uploaded to a cloud where I don't know what is happening with it.
somberi6 months ago
I use MacWhisper and it works well enough for me to stop looking for options.(MBA M2 24GB Ram - Large V3 English)
I wouldn't feel comfortable if someone were looking over my shoulder while I'm typing at a coffee shop.
I am not your customer.
bemmu6 months ago
I would like to try this, but I use Synergy to use two computers with the same keyboard. I have Aqua Voice now on the server computer, would be great if I could input text to the client computer using it as well.
vladstudio6 months ago
For the sake of alternatives, I've had good experience with https://tryvoiceink.com/
qntmfred6 months ago
I use the built-in voice typing in Windows and am pretty happy with it. How would you say this compares (presuming most of your comparisons are mac-centric)
- the_king6 months ago
  Aqua is in another league when it comes to accuracy. I just ran them side by side on a simple q to ChatGPT and here were the results...
  Aqua Voice
  What is the first recorded eclipse in human history? I'm not asking when the first one occurred, but the first written record we have of an eclipse.
  Windows Voice Typing (v11 24H2, Dell XPS 13 9340)
  What is the first recorded eclipse in human history i'm not asking 1 like the first occurred but the first ridden record we have of an eclipse
  Windows mistakes were:
  -"1" should be "when"
  -"ridden" should be "written"
  -No punctuation
aminsadeghi6 months ago
Is there going to be Linux support at some point?
- the_king6 months ago
  We can do that, with some help from the community.
  - emacsen6 months ago
    I'm a paying customer and I signed onto Aqua Voice shortly after your demo on HN.
    My experience with it has been overall positive but mixed. I enjoy using it for dictation, but I found that issuing editing commands and having them recognized/executed often took a lot longer than making an edit myself (which I can't do while in dictation mode).
    But as a paying customer, seeing you go in this direction is somewhat sad/frustrating. You're abandoning the product I use, and you're saying that if I want to see my platform supported, I or someone from the community has to provide it- for a fully proprietary paid application.
    I understand that I'm a minority user, but it's a bit disappointing to read this.
    the_king6 months ago
    Totally understand, thanks for being a customer. I'm sorry we weren't able to make the web version as smooth as we wanted to.
    We do plan to support Linux. This was probably a little bit of a blind spot for us - not realizing that anyone running a Linux desktop doesn't even have system voice tool to fall back on.
  - inciampati6 months ago
    What support do you need?
tomblomfield6 months ago
I recently started using Aqua and it's great. The team really improved the latency in the last few weeks.
hu36 months ago
How does it compare to https://wisprflow.ai ?
btw, grats!
- the_king6 months ago
  Thanks!
  We're faster, more accurate, and have a streaming option. Aqua can go from key-up to paste in as little as 450ms. Flow was closer to 1000 in our tests.
  Overall, you'll notice we make a few more tweaks to the output than Wisprflow.
  For example, Aqua + Cursor is very powerful - we syntax highlight your transcript. The easiest way to see this is to use streaming mode (double press Fn) + deep context + cursor and try asking it to change something.
  This also works in other "context rich" environments.
  - redcanvas6 months ago
    Hey, love the what you are building in this category. I've been using a competing product which you know very well about. They advertised about how you can improve your work per minute by dictation, which was the main draw for me because I do a founder. There's a lot of managerial work that I'm doing.
    It has been a godsend in terms of increasing my productivity because I no longer have to type. I think your product's accuracy and latency shortening just make this even better. I often use it and then find out, "Hey, I need to make some changes," and I need to re-edit some of the stuff, which reduces the WPM productivity amount. So I think accuracy is definitely key here. Key metric to differentiate a product.
    I am pushing this to other colleagues to get them to adopt. One challenge people are saying is that. One is that some people may not be as organized (you know, they might be a lot more organizationally structured in their mind). So for them, they're having trouble - they'd like to write things out, and by the time if things go out of their mouth, you know it's already formulated logical thought. Whereas you know people like me are a lot more verbal vomit type of person. For me it's huge because I say a lot of um like in all the other things I just dump stuff out and then organize it later.
    Whereas other people organize stuff in their brain and then dump the information out. So people who do a lot of coordination and just you know so I feel like this could be two different segments to take into account.
    Another one that's been fantastic is that we have multilingual colleagues who are speaking in Mandarin or something else and then they speak it and then ask Flow to be sent to translate it to a different language. That part I think has been fantastic.
    I think the ability to edit what you wrote with AI is going to be the next key feature. Providing the context in the window is all wiithin the conversation right? For example, you just ask after because what you write out is not the final and you need to do a lot of editing and formatting. Sometimes when you say too much stuff, it's just like a huge jumble paragraph with a lot of fluff words. Make it clear, concise, trim non-effective words. I think those are a key feature because it's not about your productivity, it's about other people being able to ingest your information efficiently. At least that's what I look at from a managerial perspective.
    To give you an example, everything I laid out above came from dictation. You can see how this is inefficient. There's a lot of inefficiencies here.
    redcanvas6 months ago
    A feature that would be great is similar to how you can write snippets in all the other tools where you can say "calendar" or "cal" and then it gives you the link. If this is something possible, I think that would make this fantastic.
    Another feature that would be great is actually being able to have a conversation with an AI model first and then refine the output iteratively until you're ready and then pipe all that over. The ability to have a chat is very good or do this all through voice.
  - hu36 months ago
    Thanks! I'm tempted to try Speech to Text + Cursor/Copilot for development. It's probably the future since most people can speak faster than they can type.
    jmcintire16 months ago
    This use case is great, even for people who haven't been interested in dictation before
bklyn112016 months ago
Music playing on Youtube in Chrome, Airpods in, the desktop and the sandbox/demo just don't work.
- the_king6 months ago
  Shoot, might be due to AirPods mic init latency. AirPods work well on Desktop (though their mic quality isn't the best).
iAMkenough6 months ago
Tried viewing the Pricing link in the footer, but it requires a Google account to view.
waveringana6 months ago
will we ever see local, open source models? they are very important for accessibility reasons which this product can fit into, but wont because of it being cloud based (and proprietary).
hasperdi6 months ago
Anyone can recommend a good dictation app on Linux?
gnfedhjmm26 months ago
Broken on mobile
- pablopeniche6 months ago
  fixing. ty