Overall, a fun experience. I think that MH was better than Scott. Max was missing the glitches and moving background but I'd imagine both of those are technically challenging to achieve.
Michael Scott's mouth seemed a bit wrong - I was thinking Michael J Fox but my wife then corrected that with Jason Bateman - which is much more like it. He knew Office references alright, but wasn't quite Steve Carell enough.
The default while it was listening could do with some work, I think - that was the least convincing bit; for Max he would have just glitched or even been completely still I would think. Michael Scott seemed too synthetic at this point.
Don't get me wrong, this was pretty clever and I enjoyed it, just trying to say what I found lacking without trying to sound like I could do better (which I couldn't!).
I am sure they will have open source solutions for fully-generated real-time video within the next year. We also plan to provide an API for our solution at some point.
How does voice work? You mentioned Deepgram. Does it mean you do Speech-to-Text-to-Speech?
The key technical unlock was getting the model to generate a video faster than real-time. This allows us to stream video directly to the user. We do this by recursively generating the video, always using the last few frames of the previous output to condition the next output. We have some tricks to make sure the video stays relatively stable even with recursion.
I do not think they are running in real time though. From the website: "Personal RTX 4090 generates at speed 2.5 seconds/frame (unoptimized) or 1.5 seconds/frame (teacache)." That means it would take them 37.5s to generate 1 second of video, which is fast for video but way slower than real time.
I mostly meant the using the previous frames to generate new frames insight that reminded me but lack knowledge on the specifics of the work
glad if its useful for your work/research to check out the paper
edit: the real-time-ness of it also has to have into equation what HW are you running your model on, obviously easier to make so on a H100 than a 3090, but these memory optimizations really help to make these models usable at all for local stuff, which is a great win i think for overall adoption/further stuff being build upon them a bit like sd-webui from automatic1111 alongside stable diffusion weights models being open sourced was a boom on image gen a couple years back
And yes, exactly. In between each character interaction we need to do speech-to-text, LLM, text-to-speech, and then our video model. All of it happens in a continuously streaming pipeline.
And that's probably still a subsidized cost!
Bfw "what is the best weed whacker" is John C. Dvorak's "AI test"
IP law tends to be "richer party wins". There's going to be a bunch of huge fights over this, as both individual artists and content megacorps are furious about this copyright infringment, but OpenAI and friends will get the "we're a hundred-billion-dollar company, we can buy our own legislation" treatment.
e.g. https://www.theguardian.com/technology/2024/dec/17/uk-propos... a straightforward nationalisation of all UK IP so that it can be instantly given away for free to US megacorps.
How can you be sure? Investing in an ASIC seems like one of the most expensive and complicated solutions.
Impressive technology - impressive demo! Sadly, the conversation seems to be a little bit overplayed. Might be worth plugging ChatGPT or some better LLM in the logic section.,
Also, you do realize that this will be used to defraud people of money and/or property, right?
All about coulda, not shoulda.
Sadly, no one cares. LLM driven fraud is already happening and it is about to become more profitable.
It seems clumsy to use copyrighted characters in your demos.
Seems to me this will be a standard way to interact with LLMs and even companies - like a receptionist/customer service/salesperson.
Obviously games could use this.
Super cool product in any case.
I'll pass thanks.
Each person gets a dedicated GPU, so we were worried about costs before. But, let' s just go for it.
Really looking forward to trying this out!