Streaming was the worst one. Kokoro doesn't expose a streaming interface as far as I could find, you hand it a chunk of text, it gives you back the full audio for that chunk. For a reading app you can't wait for a whole paragraph before playback starts, so the whole streaming layer had to be built on top. I didn't want to process the book then serve full audio, I wanted it to be interactive.
The basic shape: chunk into sentence-sized windows, render in the background, queue rendered chunks for playback, keep a small pre-render lookahead so playback never starves but the phone isn't speculatively rendering an entire chapter it might throw away on a skip.
Sentence chunking was its own fight. Too long and the model returns null and playback stops. Too short (four or five words at a time) and the naturalness diminishes, because the model uses context within a sentence to decide intonation. Chopped chunks sound like a bad GPS voice. I had to find the goldilocks window where the model is happy and the result still sounds good and handle long-sentence edge cases by splitting on secondary punctuation and stitching the audio back together without audible seams.
For battery-life there's cruise mode. When the screen is off and the next several sentences are already rendered and cached, the app swaps the whole synthesis/playback pipeline for a much lighter sequential AAC player, hardware-decoded audio files.
When the phone's on a charger, a background task pre-renders a chapter or two of upcoming audio and writes it to disk as M4A. That way, by the time you're actually reading, cruise mode has a cache to play from and the neural engine never has to wake up for long stretches. The system decides when to actually run the task, so it piggybacks on the phone's usual overnight charging window.
The Neural Engine was a disappointment. I was hoping to get Kokoro onto the ANE for the latency/efficiency win, seeing it works quite well on CPU, but it uses ops that CoreML doesn't route to the Neural Engine, so it falls back to GPU/CPU. The weird part: forcing .cpuAndNeuralEngine is actually slower than .cpuAndGPU on this model, probably partitioning cost from unsupported ops bouncing between compute units, but I don't fully understand why. If anyone on CoreML has a principled explanation I'd love to hear it.
iPhone 12 mini and lower, and simulators are cursed. They seem to run Kokoro successfully, i.e. no error, inference completes but the result is pure crackling/screeching gibberish audio. Same model, same weights, same code path. KittenTTS runs fine on the exact same hardware AND the XCode simulator. I still don't know what's going on here; Curious if anyone's seen similar.
KittenTTS was easy. Ported it as a fallback for older devices and published a minimal iOS example repo while I was at it: https://github.com/pepinu/KittenTTS-iOS if you just want to see how to get a neural TTS model running on iPhone without the full app machinery around it.
Before I got the iPhone optimization work far enough along, Kokoro ran in real time on a MacBook that I was literally putting a laptop on the passenger seat for long drives just to have something read to me. Very inconvenient, but it made me commit to getting the phone path right. The current build isn't really tested on Mac, maybe in the future.
On the LLM tooling question up front: YES, used Claude Code and Codex throughout. I might be too much into tokenmaxxing though, since I'd run several sessions in tandem for bug hunting and several more for review to get wisdom of the crowd of sorts.