Show HN: Open-source, native audio turn detection model(github.com)

126 pointsby kwindla4 months ago13 comments

pzo4 months ago
I will have a look at this. Played with pipecat before and it's great, switched to sherpa-onnx though since I need something that compile to native and can run on edge devices.
I'm not sure if turn detection can be really solved except dedicated push to talk button like in walkie-talkie. I often tried google translator app and the problem is in many times when you speaking longer sentence you will stop or slow down a little to gather thought before continuing talking (especially if you are not native speaker). For this reason I avoid converation mode in such cases like google translator and when using perplexity app I prefer the push to talk button mode instead of new one.
I think this could be solved but we would need not only low latency turn detection but also low latency speech interruption detection and also very fast low latency llm on device. And in case we have interruption good recovery that system know we continue last sentence instead of discarding previous audio and starting new etc.
Lots of things can be improved also regarding i/o latency, like using low latency audio api, very short audio buffer, dedicated audio category and mode (in iOS), using wired headsets instead of buildin speaker, turning off system processing like in iphone audio boosting or polar pattern. And streaming mode for all STT, transport (using using remote LLM), TTS. Not sure if we can have TTS in streaming mode. I think most of the time they split by sentence.
I think push to talk is a good solution if well designed: big button in place easily reached with your thumb, integration with iphone action button, using haptic for feedback, using apple watch as big push button, etc.
- genewitch4 months ago
  Whisper can chunk on word boundaries or split on word boundaries. The speaker diarization stuff, I can't remember the name offhand, but it also can split on the word boundaries since it needs to identify speakers per words.
kwindla4 months ago
A couple of interesting updates today:
- 100ms inference using CoreML: https://x.com/maxxrubin_/status/1897864136698347857
- An LSTM model (1/7th the size) trained on a subset of the data: https://github.com/pipecat-ai/smart-turn/issues/1
foundzen4 months ago
I got most of my answers from the README. Well written. I read most of it. Can you share what kind of resources (and how much of them) were required to fine tune Wav2Vec2-BERT?
- kwindla4 months ago
  It takes about 45 minutes to do the current training run on an L4 GPU with these settings:
  # Training parameters "learning_rate": 5e-5, "num_epochs": 10, "train_batch_size": 12, "eval_batch_size": 32, "warmup_ratio": 0.2, "weight_decay": 0.05, # Evaluation parameters "eval_steps": 50, "save_steps": 50, "logging_steps": 5, # Model architecture parameters "num_frozen_layers": 20
  I haven't seen a run do all 10 epochs, recently. There's usually an early stop after about 4 epochs.
  The current data set size is ~8,000 samples.
remram4 months ago
Ok what's turn detection?
- kwindla4 months ago
  Turn detection is deciding when a person has finished talking and expects the other party in a conversation to respond. In this case, the other party in the conversation is an LLM!
  - remram4 months ago
    Oh I see. Not like segmenting a conversation where people speak in turn. Thanks.
    password43214 months ago
    Speaker diarization is also still a tough problem for free models.
    whiddershins4 months ago
    huh. how is analyzing conversations in the manner you described NOT the way to train such a model?
    remram4 months ago
    Did you reply to the wrong comment? No one is taking about training here.
- ry1674 months ago
  Detecting when one user of a conversation has finished talking.
  It’s a big deal for detecting human speech when interacting with LLM systems
- woodson4 months ago
  It’s often called endpoint detection (in ASR).
  - lelag4 months ago
    Yes, weird that they didn't use that term for this project.
    kwindla4 months ago
    I've talked about this a lot with friends.
    Endpoint detection (and phrase endpointing, and end of utterance) are terms from the academic literature about this, and related, problems.
    Very few people who are doing "AI Engineering" or even "Machine Learning" today know these terms. In the past, I argued that we should use the existing academic language rather than invent new terms.
    But then OpenAI released the Realtime API and called this "turn detection" in their docs. And that was that. It no longer made sense to use any other verbiage.
    mncharity4 months ago
    Re SEO, I note "utterance" only occurs once, in a perhaps-ephemeral "Things to do" description.
    To help with "what is?" and SEO, perhaps something like "Turn detection (aka [...], end of utterance)"... ?
    4 months ago
    undefined
    lelag4 months ago
    Thank for the explanation. I guess it makes some sense, considering many people with no nlp background are using those models now…
xp844 months ago
I'm excited to see this particular technology developing more. From the absolute worst speech systems such as Siri, who will happily interrupt to respond with nonsense at the slightest half-pause, to even ChatGPT voice mode which at least tries, we haven't yet successfully got computers to do a good job of this - and I feel it may be the biggest obstacle in making 'agents' that are competent at completing simple but useful tasks. There are so many situations where humans "just know" when someone hasn't yet completed a thought, but "AI" still struggles, and those errors can just destroy the efficiency of a conversation or worse, lead to severe errors in function.
zamalek4 months ago
As an [diagnosed] HF autistic person, this is unironically something I would go for in an earpiece. How many parameters is the model?
- kwindla4 months ago
  580M parameters. More info about the model architecture: https://github.com/pipecat-ai/smart-turn?tab=readme-ov-file#...
  - cyberbiosecure4 months ago
    580m, awesome, incredible
  - meltyness4 months ago
    ... but will the model learn when to interrupt you out of frustration with your ongoing statements, and start shouting?
    it seems like for the obvious use-cases there might need to be some sort of limit on how much this component knows
written-beyond4 months ago
Having reviewed a few turn based models your implementation is pretty inline with them. Excited to see how this matures!
- kwindla4 months ago
  Can you say more? There's not much open source work in this domain, that I've been able to find.
  I'm particularly interested in architecture variations, approaches to the classification head design and loss function, etc.
prophesi4 months ago
I'd love for Vedal to incorporate this in Neuro-sama's model. An osu bot turned AI Vtuber[0].
[0] https://www.youtube.com/shorts/eF6hnDFIKmA
lostmsu4 months ago
Does this support multiple speakers?
- kwindla4 months ago
  In general, for realtime voice AI you don't want this model to support multiple speakers because you have a separate voice input stream for each participant in a session.
  We're not doing "speaker diarization" from a single audio track, here. We're streaming the input from each participant.
  If there are multiple participants in a session, we still process each stream separately either as it comes in from that user's microphone (locally) or as it arrives over the network (server-side).
cyberbiosecure4 months ago
forking...
fdafdsa4 months ago
[dead]
4 months ago
undefined
fdsd4 months ago
[dead]