195 pointsby petewarden8 hours ago24 comments
  • dagss2 minutes ago
    Exciting stuff! You just give this away for free? What are your plans for monetization?

        hear about what people might build with it
    
    My startup is making software for firefighters to use during missions on tablets, excited to see (when I get the time) if we can use this as a keyboard alternative on the device. Due to the sector we try to rely on the cloud as little as possible and run things either on device or with the possibility of being self-hosted/on-premise. We would need Norwegian though for our current customers.

    Looking through your web

  • Karrot_Kream6 hours ago
    According to the OpenASR Leaderboard [1], looks like Parakeet V2/V3 and Canary-Qwen (a Qwen finetune) handily beat Moonshine. All 3 models are open, but Parakeet is the smallest of the 3. I use Parakeet V3 with Handy and it works great locally for me.

    [1]: https://huggingface.co/spaces/hf-audio/open_asr_leaderboard

    • reitzensteinm5 hours ago
      Parakeet V3 is over twice the parameter count of Moonshine Medium (600m vs 245m), so it's not an apples to apples comparison.

      I'm actually a little surprised they haven't added model size to that chart.

      • agentifysh19 minutes ago
        So I'm kinda new to this whole parakeet and moonshine stuff, and I'm able to run parakeet on a low end CPU without issues, so I'm curious as to how much that extra savings on parameters is actually gonna translate.

        Oh and I type this in handy with just my voice and parakeet version three, which is absolutely crazy.

    • theologic4 hours ago
      By the way, I've been using a Whisper model, specifically WhisperX, to do all my work, and for whatever reason I just simply was not familiar with the Handy app. I've now downloaded and used it, and what a great suggestion. Thank you for putting it here, along with the direct link to the leaderboard.

      I can tell that this is now definitely going to be my go-to model and app on all my clients.

      • jasonjmcgheean hour ago
        I have to ask- I see this handy app running on Mac and you hold a key down and then it doesn't show until seemingly a while later.

        The one built in is much faster, and you only have to toggle it on.

        Are these so much more accurate? I definitely have to correct stuff, but pretty good experience.

        Also use speech to text on my iphone which seems to be the same accuracy.

    • tuananh2 hours ago
      Handy is amazing. Super quality app.
      • agentifysh21 minutes ago
        It really is. It's kinda ridiculous that it's free.
    • agentifysh2 hours ago
      hmmm looks like assembyAI is still unbeatable here in terms of cost/performance unless im mistaken

      edit: holy shit parakeet is good.... Moonshine impressive too and it is half the param

      Now if only there was something just as quick as Parakeet v3 for TTS ! Then I can talk to codex all day long!!!

      • remuskaosan hour ago
        Parakeet doesn't require a GPU. I'm handily running it on my Ubuntu Linux laptop.
        • agentifyshan hour ago
          you are right i just downloaded it on handy and its working i can't believe it

          i was using assmeblyAI but this is fast and accurate and offline wtf!

    • syntaxing4 hours ago
      How much VRAM does parakeet take for you? For some reason it takes 4GB+ for me using the onyx version even though it’s 600M parameters
    • tomr754 hours ago
      why V3 over V2 (assuming English only)?
  • RobotToaster7 minutes ago
    > Models for other languages are released under the Moonshine Community License, which is a non-commercial license.

    Weird to only release English as open weights.

  • heftykoo3 hours ago
    Claiming higher accuracy than Whisper Large v3 is a bold opening move. Does your evaluation account for Whisper's notorious hallucination loops during silences (the classic 'Thank you for watching!'), or is this purely based on WER on clean datasets? Also, what's the VRAM footprint for edge deployments? If it fits on a standard 8GB Mac without quantization tricks, this is huge.
  • francislavoie4 hours ago
    I've helped many Twitch streamers set up https://github.com/royshil/obs-localvocal to plug transcription & translation into their streams, mainly for German audio to English subtitles.

    I'd love a faster and more accurate option than Whisper, but streamers need something off-the-shelf they can install in their pipeline, like an OBS plugin which can just grab the audio from their OBS audio sources.

    I see a couple obvious problems: this doesn't seem to support translation which is unfortunate, that's pretty key for this usecase. Also it only supports one language at a time, which is problematic with how streamers will frequently code-switch while talking to their chat in different languages or on Discord with their gameplay partners. Maybe such a plugin would be able to detect which language is spoken and route to one or the other model as needed?

  • guerython3 hours ago
    Nice work. One metric I’d really like to see for streaming use cases is partial stability, not just final WER.

    For voice agents, the painful failure mode is partials getting rewritten every few hundred ms. If you can share it, metrics like median first-token latency, real-time factor, and "% partial tokens revised after 1s / 3s" on noisy far-field audio would make comparisons much more actionable.

    If those numbers look good, this seems very promising for local assistant pipelines.

  • nmstoker5 hours ago
    Any plans regarding JavaScript support in the browser?

    There was an issue with a demo but it's missing now. I can't recall for sure but I think I got it working locally myself too but then found it broke unexpectedly and I didn't manage to find out why.

  • fareesh5 hours ago
    Accuracy is often presumed to be english, which is fine, but it's a vague thing to say "higher" because does it mean higher in English only? Higher in some subset of languages? Which ones?

    The minimum useful data for this stuff is a small table of language | WER for dataset

  • armcat6 hours ago
    This is awesome, well done guys, I’m gonna try it as my ASR component on the local voice assistant I’ve been building https://github.com/acatovic/ova. The tiny streaming latencies you show look insane
  • ac296 hours ago
    No idea why 'sudo pip install --break-system-packages moonshine-voice' is the recommended way to install on raspi?

    The authors do acknowledge this though and give a slightly too complex way to do this with uv in an example project (FYI, you dont need to source anything if you use uv run)

  • 9999000009995 hours ago
    Very cool. Anyway to run this in Web assembly, I have a project in mind
  • an hour ago
    undefined
  • asqueella6 hours ago
    For those wondering about the language support, currently English, Arabic, Japanese, Korean, Mandarin, Spanish, Ukrainian, Vietnamese are available (most in Base size = 58M params)
  • starkparker3 hours ago
    Implemented this to transcribe voice chat in a project and the streaming accuracy in English on this was unusable, even with the medium streaming model.
  • pzo6 hours ago
    haven't tested yet but I'm wondering how it will behave when talking about many IT jargon and tech acronyms. For those reason I had to mostly run LLM after STT but that was slowing done parakeet inference. Otherwise had problems to detect properly sometimes when talking about e.g. about CoreML, int8, fp16, half float, ARKit, AVFoundation, ONNX etc.
  • oezi2 hours ago
    Do you also support timestamps the detected word or even down to characters?
  • saltwounds4 hours ago
    Streaming transcription is crazy fast on an M1. Would be great to use this as a local option versus Wispr Flow.
  • g-mork6 hours ago
    How does this compare to Parakeet, which runs wonderfully on CPU?
  • sroussey6 hours ago
    onnx models for browser possible?
  • raybb2 hours ago
    fyi the typepad link in your bio is broken
  • alexnewman5 hours ago
    If only it did Doric
  • lostmsu7 hours ago
    How does it compare to Microsoft VibeVoice ASR https://news.ycombinator.com/item?id=46732776 ?
  • cyanydeez7 hours ago
    No LICENSE no go
    • bangaladore7 hours ago
      There is a license blurb in the readme.

      > This code, apart from the source in core/third-party, is licensed under the MIT License, see LICENSE in this repository.

      > The English-language models are also released under the MIT License. Models for other languages are released under the Moonshine Community License, which is a non-commercial license.

      > The code in core/third-party is licensed according to the terms of the open source projects it originates from, with details in a LICENSE file in each subfolder.

      • mkl4 hours ago
        The LICENSE file that refers to is missing. There's one in the python folder, but not for the rest of the code.
    • altruios7 hours ago
      reading through readme.md "License This code, apart from the source in core/third-party, is licensed under the MIT License, see LICENSE in this repository.

      The English-language models are also released under the MIT License. Models for other languages are released under the Moonshine Community License, which is a non-commercial license.

      The code in core/third-party is licensed according to the terms of the open source projects it originates from, with details in a LICENSE file in each subfolder."

  • aplomb10265 hours ago
    [flagged]