Gemma can take in audio, images, and text, but only talks back in text. Mimi can turn codec tokens back into speech. So I froze both sides and trained a small graft in the middle: Gemma hidden states -> Mimi audio tokens.
I've enjoyed playing with this because the bad audio outputs have sounded hilarious