Hi, It is actually not using transformers, those would be too slow. It is using a combination of CNN's and linear layers. Correct, it uses embedings, not waveforms or spectrograms. The inputs are midis, some of which I made myself in FL Studio. The model creates a "latent representation" from each midi, I can then sample randomly from this latent space to get an original piece. The most important part is the preprocessing in my opinion.
That's fascinating. This sounds like a variational autoencoder. The embeddings, which from my humble point of view (as a trained musician) are a largely unexplored field not really supported by existing theory, are at the same time game-deciding. Have you found a good solution for this?