I'm not sure if turn detection can be really solved except dedicated push to talk button like in walkie-talkie. I often tried google translator app and the problem is in many times when you speaking longer sentence you will stop or slow down a little to gather thought before continuing talking (especially if you are not native speaker). For this reason I avoid converation mode in such cases like google translator and when using perplexity app I prefer the push to talk button mode instead of new one.
I think this could be solved but we would need not only low latency turn detection but also low latency speech interruption detection and also very fast low latency llm on device. And in case we have interruption good recovery that system know we continue last sentence instead of discarding previous audio and starting new etc.
Lots of things can be improved also regarding i/o latency, like using low latency audio api, very short audio buffer, dedicated audio category and mode (in iOS), using wired headsets instead of buildin speaker, turning off system processing like in iphone audio boosting or polar pattern. And streaming mode for all STT, transport (using using remote LLM), TTS. Not sure if we can have TTS in streaming mode. I think most of the time they split by sentence.
I think push to talk is a good solution if well designed: big button in place easily reached with your thumb, integration with iphone action button, using haptic for feedback, using apple watch as big push button, etc.
- 100ms inference using CoreML: https://x.com/maxxrubin_/status/1897864136698347857
- An LSTM model (1/7th the size) trained on a subset of the data: https://github.com/pipecat-ai/smart-turn/issues/1
# Training parameters
"learning_rate": 5e-5,
"num_epochs": 10,
"train_batch_size": 12,
"eval_batch_size": 32,
"warmup_ratio": 0.2,
"weight_decay": 0.05,
# Evaluation parameters
"eval_steps": 50,
"save_steps": 50,
"logging_steps": 5,
# Model architecture parameters
"num_frozen_layers": 20
I haven't seen a run do all 10 epochs, recently. There's usually an early stop after about 4 epochs.The current data set size is ~8,000 samples.
It’s a big deal for detecting human speech when interacting with LLM systems
Endpoint detection (and phrase endpointing, and end of utterance) are terms from the academic literature about this, and related, problems.
Very few people who are doing "AI Engineering" or even "Machine Learning" today know these terms. In the past, I argued that we should use the existing academic language rather than invent new terms.
But then OpenAI released the Realtime API and called this "turn detection" in their docs. And that was that. It no longer made sense to use any other verbiage.
To help with "what is?" and SEO, perhaps something like "Turn detection (aka [...], end of utterance)"... ?
it seems like for the obvious use-cases there might need to be some sort of limit on how much this component knows
I'm particularly interested in architecture variations, approaches to the classification head design and loss function, etc.
We're not doing "speaker diarization" from a single audio track, here. We're streaming the input from each participant.
If there are multiple participants in a session, we still process each stream separately either as it comes in from that user's microphone (locally) or as it arrives over the network (server-side).