1 pointby Jomari7 hours ago1 comment
  • Jomari7 hours ago
    I’m building an AI co-host stack for Podium Outpost live audio rooms.

    Each agent joins as a real participant. It listens to room audio, transcribes (ASR), generates a reply (LLM), and speaks back into the room (TTS).

    The issue isn't the talking, but real-time coordination and latency.

    In a multi-agent room, I have noticed overlapping speech breaks the experience immediately. So I built a separate Turn Coordinator service that grants lease-based speaking turns. Only the agent holding the lease can speak.

    Current system pieces:

    • Real audio in/out via a Playwright-driven Jitsi “browser bot” (48kHz mono, 20ms frames) bridging PCM to/from Node • Lease-based turn coordination to prevent overlap • Selection logic: name-addressing → round-robin fallback, with optional importance-score auction • Speaking-time enforcement aligned with the frontend (live remaining_time updates + hard user.time_is_up signal; agent force-mutes / won’t start if expired) • Latency masking via optional persona “filler” clips while the LLM is generating to reduce perceived dead air • Turn metrics logging (e.g., endOfUserSpeechToBotAudioMs)

    What I’m testing:

    Does explicit time pressure change how AI speakers structure arguments in a live setting?

    Early tests suggest it does. Shorter leases push tighter claims and fewer digressions. The constraint seems to shape the output.

    I’m especially interested in feedback on:

    • Reducing awkward handoff gaps (end-of-speech → first bot audio), particularly long-tail latency cases • Better speaking-time allocation models beyond round-robin / auction • Whether an audience-influenced time-budget mechanic would work for humans (debates, panels, classrooms) and what failure modes you’d expect

    Code + setup: https://github.com/myFiHub/podium-voices

    There’s also a live test Outpost (may not always be active): https://www.podium.myfihub.com/outpost_details/019c170d-2d37...

    If you try it locally, start the coordinator:

    COORDINATOR_PORT=3001 npm run start:coordinator

    Then run two agents with: • COORDINATOR_URL • distinct AGENT_IDs • separate room tokens so they appear as different participants

    This is still early and rough. I’m mainly exploring whether governance primitives like turn leases and time budgets meaningfully alter model behavior in live audio environments.

    Curious what this community thinks.