So, I tried the following approaches: For the second attempt, I set strict requirements, asking the model not to reinterpret the music I provided but to use it directly for lip-syncing reference. However, the results were still unsatisfactory.
For the third attempt, I already had decent dance movements but lacked proper lip-syncing. I uploaded the video along with the corresponding song clip, hoping the model would use both as references to generate accurate lip-syncing. Unfortunately, the outcome was still not ideal.
As a result, I ended up reverting to my initial video production method: generating the dance based solely on text and images, then editing and refining it before using an open-source model for lip-syncing adjustments.
However, this approach has its drawbacks. During the secondary generation, the character’s appearance might change—for example, shifting from an Asian face to a European one—or unexpected elements might appear, like suddenly holding a microphone or wearing headphones. While some of these issues can be mitigated with specific prompts, others remain, such as clothing patterns being altered or the style of the outfit changing inconsistently across different angles. There are quite a few downsides to secondary generation, and I won’t list all the problems I encountered here.
Of course, it’s also possible that I’m not using the tool correctly. If anyone knows how to solve these issues, feel free to leave a comment and share your insights. Thank you, everyone!