Pure C, CPU-only inference with Mistral Voxtral Realtime 4B speech to text model(github.com)

110 pointsby Curiositry7 hours ago5 comments

written-beyond18 minutes ago
Funny, this and the Rust runtime implementation are neck and neck on the frontpage right now.
Cool project!
sgt30 minutes ago
I'm very interested in speech to text - but like tricky dialects and use of various terminologies but I'm still confused as to where to start in the best possible place, in order to train the models with a huge database of voice samples I own.
Any ideas from the HN crowd currently involved in speech 2 text models?
Curiositry5 hours ago
This was a breeze to install on Linux. However, I haven't managed to get realtime transcription working yet, ala Whisper.cpp stream or Moonshine.
--from-mic only supports Mac. I'm able to capture audio with ffmpeg, but adapting the ffmpeg example to use mic capture hasn't worked yet:
ffmpeg -f pulse -channels 1 -i 1 -f s16le - 2>/dev/null | ./voxtral -d voxtral-model --stdin
It's possible my system is simply under spec for the default model.
I'd like to be able to use this with the voxtral-q4.gguf quantized model from here: https://huggingface.co/TrevorJS/voxtral-mini-realtime-gguf
- jwrallie3 hours ago
  I am interested in a way to capture audio not only from the mic, but also from one of the monitor ports so you could pipe the audio you are hearing from the web directly for real-time transcription with one of these solutions. Did anyone manage to do that?
  I can, for example, capture audio from that with Audacity or OBS Studio and do it later, so it should be possible to do it in real time too assuming my machine can keep up.
  - bebnaan hour ago
    Set -i 1 to -i default or to one of your monitors, look them up with pactl list short sources
    https://trac.ffmpeg.org/wiki/Capture/PulseAudio
- yjftsjthsd-h3 hours ago
  Does it work if you use ffmpeg to feed it audio from a file? I personally would try file->ffmpeg->voxtral then mic->ffmpeg->file, and then try to glue together mic->ffmpeg->voxtral.
  (But take with grain of salt; I haven't tried yet)
  - Curiositryan hour ago
    Recording audio with FFMPEG, and transcribing a file that’s piped from FFMPEG both work.
    Given that it took 19.64 mins to transcribe the 11 second sample wav, it’s possible I just didn’t wait long enough :)
    yjftsjthsd-h4 minutes ago
    Ah. In that case... Yeah. Is it using GPU, and does the whole model fit in your (V)RAM?
2 hours ago
undefined
genie3io6 minutes ago
[dead]