I ran an analysis on hundreds generated videos featuring various races/ethnicities and found that Chinese models are overfitted on East Asian faces (predictable though) and have trouble properly animating many European/most African faces (bad lipsync).
They all had accumulating artifacts over the long term (the video stops being stable after N seconds, for example the image gets more and more washed out)
So I don't have high hopes here, everyone on the demo page is predictably East Asian and the output quality doesn't look better than prior art. I guess the innovation here is that it's end-to-end but we need to see if it's any good. WAN-derived image-audio-to-video systems used to be notoriously slow, here they boast 25 FPS for 192p but it's pretty slow actually, I managed to reach similar FPS for 720p with prior art.