The latency numbers are what stood out to me though. ~70ms time-to-first-frame is genuinely impressive for an interactive loop. In real conversations, responsiveness dominates perceived realism way more than visual fidelity, so that correlation result makes intuitive sense.
Curious how robust the audio-to-motion mapping is under messy real-world input (overlapping speech, accents, background noise, etc.). Does the flow-matching variant help mostly with stability during training, or also temporal consistency during inference?
audio-to-motion is fairly robust to noisy TTS and differing languages / accents. It doesn't use the raw audio as input, we first embed the audio using a pretrained wav2vec-style embedder, trained on millions of audio samples.
Saying this, we haven't properly evaluated in multiple languages, and we have heard from customers that lip-sync isn't always as good in non-english. For Cara 4 we're training on more diverse data, which will hopefully close this gap.
Most off the shelf solutions, or existing platforms heavily skew towards the normal http web service world. However, the bulk of our interactions happen over webrtc in long-running sessions, where the existing solutions for in-depth metrics and monitoring are much less mature and well documented.
Currently we're using influxdb, prometheus, grafana and some hand rolled monitoring code alongside the stats that webrtc offers itself. Would be interested to know how anyone out there is monitoring conversational flows, and webrtc traffic.
Personally I like using LLMs for getting information (not chat) or solving problems, and I like the fact it's text and I can read it quicker than a normal conversation, and don't need to look for facial cues when ingesting the information provided (am I autistic?), but I might be a minority...
Some people might really find this useful.