23 pointsby grayne4 hours ago3 comments
  • iogbole3 hours ago
    Really interesting architecture choice separating motion from rendering. That feels like the right abstraction boundary if you want identity generalisation without retraining.

    The latency numbers are what stood out to me though. ~70ms time-to-first-frame is genuinely impressive for an interactive loop. In real conversations, responsiveness dominates perceived realism way more than visual fidelity, so that correlation result makes intuitive sense.

    Curious how robust the audio-to-motion mapping is under messy real-world input (overlapping speech, accents, background noise, etc.). Does the flow-matching variant help mostly with stability during training, or also temporal consistency during inference?

    • graynean hour ago
      Thanks for the question.

      audio-to-motion is fairly robust to noisy TTS and differing languages / accents. It doesn't use the raw audio as input, we first embed the audio using a pretrained wav2vec-style embedder, trained on millions of audio samples.

      Saying this, we haven't properly evaluated in multiple languages, and we have heard from customers that lip-sync isn't always as good in non-english. For Cara 4 we're training on more diverse data, which will hopefully close this gap.

  • peanut_merchant4 hours ago
    One of the backend developers at Anam here, one of the hardest parts of developing this has been monitoring and analytics.

    Most off the shelf solutions, or existing platforms heavily skew towards the normal http web service world. However, the bulk of our interactions happen over webrtc in long-running sessions, where the existing solutions for in-depth metrics and monitoring are much less mature and well documented.

    Currently we're using influxdb, prometheus, grafana and some hand rolled monitoring code alongside the stats that webrtc offers itself. Would be interested to know how anyone out there is monitoring conversational flows, and webrtc traffic.

  • grayne3 hours ago
    • 72deluxean hour ago
      Very clever and quite frightening. Well done.

      Personally I like using LLMs for getting information (not chat) or solving problems, and I like the fact it's text and I can read it quicker than a normal conversation, and don't need to look for facial cues when ingesting the information provided (am I autistic?), but I might be a minority...

      Some people might really find this useful.