This is my voice AI stack:
- ESP32 on Arduino to interface with the Voice AI pipeline
- mlx-audio for STT (whisper) and TTS with streaming (`qwen3-tts` / `chatterbox-turbo`)
- mlx-vlm to use vision language models like Qwen3.5-9B and Mistral
- mlx-lm to use LLMs like Qwen3, Llama3.2, Gemma3
- Secure websockets to interface with a Macbook
This repo supports inference on Apple Silicon chips (M1/2/3/4/5) but I am planning to add Windows soon. Would love to hear your thoughts on the project.