4 pointsby ipotapov8 hours ago1 comment
  • ipotapov8 hours ago
    Been building this for a few months now and it's turned into a complete on-device audio pipeline for Apple Silicon:

    ASR (Qwen3) → TTS (Qwen3 + CosyVoice, 10 languages) → Speech-to-Speech (PersonaPlex 7B, full-duplex) → Speaker Diarization (pyannote + WeSpeaker) → Voice Activity Detection (Silero, real-time streaming) → Forced Alignment (word-level timestamps)

    No Python, no server, no CoreML — pure Swift through MLX. Models download automatically from HuggingFace on first run. The whole diarization stack is ~32 MB.

    Everything is protocol-based and composable — VAD gates ASR, diarization feeds into transcription, embeddings enable speaker verification. Mix and match.

    Repo: github.com/ivan-digital/qwen3-asr-swift (Apache 2.0)

    Blog post with architecture details: blog.ivan.digital

    There's a lot of surface area here and contributions are very welcome — whether it's new model ports, iOS integration, performance work, or just filing issues. If you've been wanting to do anything with audio or MLX in Swift, come build with us.