We've spent 2.5 years building a generative music model and a brand-new interface for making music. We are a tiny team of self-taught researchers and seasoned engineers and designers. We all love and make music, and felt unsatisfied with how existing music models were deployed and used.
We started with a long, grueling journey into building and refining the whole stack to pretrain and posttrain our music model on a shoestring budget. Some key lessons along the way: the trap of thinking that compression and reconstruction in the VAE / audio codec were the most important things, when what really matters is the downstream learnability of the latents, and that hyperparameter-optimal scaling laws are the best way to ablate training recipe and architecture experiments.
We finally reached a level of quality with our models that we believe punches well above its compute budget (~$75k for our flagship pretrain). Aphrodite, the audio codec (~3 kbps), Apollo, our diffusion transformer, and Virgil, our synthetic audio captioner.
Our next step was to rethink the HCI layer for music models. We rebuilt the DAW / timeline experience from first principles into a beginner friendly web GAW (Generative Audio Workstation) where inpainting, extending, remixing, multi track / stem editing are as intuitive as "painting" on the screen. We aimed for an experience that stays in touch with the spirit of creation, where the creator still walks out of a finished song feeling pride in what they made, and yet accessible enough for anyone to feel the joy of making music.
Come check it out!
Manifesto + launch video on X: https://x.com/audiogen/status/2059297892465062386?s=20
Audio samples + product demo: https://audiogen.co/demos
Waitlist: https://audiogen.co/waitlist
Happy to answer questions on the architecture, infra, or anything else.
Wouldn’t it make more sense for AI music to do something that human beings genuinely cannot do, rather than copy us? There’s nothing as soulless as AI track that you call soul…