I'm trying to wrap my head around the inference latency trade-offs here. If you're replacing the standard softmax output with a diffusion-based generative head, aren't you effectively swapping a cheap matrix operation for an iterative denoising process at every single step? It seems like the compute cost per second of generated audio/video would be massive compared to standard discrete tokenization.