Sopro: A 169M parameter real-time TTS model with zero-shot voice cloning(github.com)

6 pointsby marques576a month ago1 comment

marques576a month ago
Some features:
169M parameters
Streaming support
Zero-shot voice cloning
0.25 RTF on CPU, meaning it generates 30 seconds of audio in 7.5 seconds
Requires 3-12 seconds of reference audio for voice cloning
Apache 2.0 license
The model was trained on a single L40S GPU. It’s not SOTA in most cases, can be a bit unstable, and sometimes fails to capture voice likeness.