I was using https://huggingface.co/onnx-community/pyannote-segmentation-... because with ONNX, I could run it on Intel servers with vectorized instructions, locally on my Mac, AND in-browser with transformers.js
VAD is absurdly time-effective (I think like O(10s) to segment 1hr of audio or something) and reduces the false positive rate/cost of transcription and multimodal inference since you can just pass small bits of segmented audio into another model specializing in that, then encode it as text before passing it to the expensive model.
Also, I had a huge head start, as I spent a month or two working on this in September 2025, shelved it and dusted it back off this weekend.
I think in https://github.com/mattmireles/gemma-tuner-multimodal/blob/m... and https://github.com/mattmireles/gemma-tuner-multimodal/blob/m... and https://github.com/mattmireles/gemma-tuner-multimodal/blob/m... you have a superset of the various cludges I have in my finetuning repo, I'm going to study this and do what I can to learn from it. Really appreciate you sharing it here!
Definitely interested in swapping notes if you are though. Probably the biggest thing that came out of this exercise for us was realizing that Apple actually has some really powerful local inference/data processing tools available locally, they just are much more marketed towards application developers so a lot of them fly under the radar.
We just published https://github.com/accretional/macos-vision to make it easy for anybody to use Apple's local OCR, image segmentation, foreground-masking, facial analysis, classification, and video tracking functionality accessible via CLI and hopefully more commonly in ML and data workloads. Hopefully you or someone else can get some use of it. I definitely will from yours!
Here’s the trick: use Gemini Pro deep research to create “Advanced Hacker’s Field Guide for X” where X is the problem that you are trying to solve. Ask for all the known issues, common bugs, unintuitive patterns, etc. Get very detailed if you want.
Then feed that to Claude / Codex / Cursor. Basically, create a cheat sheet for your AI agents.
This will unlock a whole new level of capability.
I’m @mattmireles on Twitter — feel free to DM me.
do you really need that much data for fine-tuning?
The actual plan was to distill Gemini 2.5 Pro into the best on-device voice dictation model.
Pretty sure it would have worked. Alas.
What is the practical latency difference you see between on-device and, say, whisper, in streaming mode, over the internet? Comparable? Seems that internet latency would be mostly negligible (assuming reasonable internet/cell coverage), or at least compensated for by the higher end hardware on the other side?
If you run a smaller whisper-distil variant AND you optimize the decoder to run on Apple Neural Engine, you can get latency down to ~300ms without any backend infra.
The issue is that the smaller models tend to suck, which is why the fine-tuning is valuable.
My hypothesis is that you can distill a giant model like Gemini into a tiny distilled whisper model.
but it depends on the machina you are running, which is why local AI is a PITA.