Full speech pipeline in native Swift/MLX – ASR, TTS, speech-to-speech, on-device(github.com)

4 pointsby ipotapov8 hours ago1 comment

ipotapov8 hours ago
Been building this for a few months now and it's turned into a complete on-device audio pipeline for Apple Silicon:
ASR (Qwen3) → TTS (Qwen3 + CosyVoice, 10 languages) → Speech-to-Speech (PersonaPlex 7B, full-duplex) → Speaker Diarization (pyannote + WeSpeaker) → Voice Activity Detection (Silero, real-time streaming) → Forced Alignment (word-level timestamps)
No Python, no server, no CoreML — pure Swift through MLX. Models download automatically from HuggingFace on first run. The whole diarization stack is ~32 MB.
Everything is protocol-based and composable — VAD gates ASR, diarization feeds into transcription, embeddings enable speaker verification. Mix and match.
Repo: github.com/ivan-digital/qwen3-asr-swift (Apache 2.0)
Blog post with architecture details: blog.ivan.digital
There's a lot of surface area here and contributions are very welcome — whether it's new model ports, iOS integration, performance work, or just filing issues. If you've been wanting to do anything with audio or MLX in Swift, come build with us.