I am sure they will have open source solutions for fully-generated real-time video within the next year. We also plan to provide an API for our solution at some point.
How does voice work? You mentioned Deepgram. Does it mean you do Speech-to-Text-to-Speech?
The key technical unlock was getting the model to generate a video faster than real-time. This allows us to stream video directly to the user. We do this by recursively generating the video, always using the last few frames of the previous output to condition the next output. We have some tricks to make sure the video stays relatively stable even with recursion.
I do not think they are running in real time though. From the website: "Personal RTX 4090 generates at speed 2.5 seconds/frame (unoptimized) or 1.5 seconds/frame (teacache)." That means it would take them 37.5s to generate 1 second of video, which is fast for video but way slower than real time.
I mostly meant the using the previous frames to generate new frames insight that reminded me but lack knowledge on the specifics of the work
glad if its useful for your work/research to check out the paper
edit: the real-time-ness of it also has to have into equation what HW are you running your model on, obviously easier to make so on a H100 than a 3090, but these memory optimizations really help to make these models usable at all for local stuff, which is a great win i think for overall adoption/further stuff being build upon them a bit like sd-webui from automatic1111 alongside stable diffusion weights models being open sourced was a boom on image gen a couple years back
And yes, exactly. In between each character interaction we need to do speech-to-text, LLM, text-to-speech, and then our video model. All of it happens in a continuously streaming pipeline.
Impressive technology - impressive demo! Sadly, the conversation seems to be a little bit overplayed. Might be worth plugging ChatGPT or some better LLM in the logic section.,
Super cool product in any case.
It seems clumsy to use copyrighted characters in your demos.
Seems to me this will be a standard way to interact with LLMs and even companies - like a receptionist/customer service/salesperson.
Obviously games could use this.
How can you be sure? Investing in an ASIC seems like one of the most expensive and complicated solutions.
I'll pass thanks.
Each person gets a dedicated GPU, so we were worried about costs before. But, let' s just go for it.
Really looking forward to trying this out!