Qwen3-TTS family is now open sourced: Voice design, clone, and generation(qwen.ai)

744 pointsby Palmik16 days ago32 comments

simonw16 days ago
If you want to try out the voice cloning yourself you can do that an this Hugging Face demo: https://huggingface.co/spaces/Qwen/Qwen3-TTS - switch to the "Voice Clone" tab, paste in some example text and use the microphone option to record yourself reading that text - then paste in other text and have it generate a version of that read using your voice.
I shared a recording of audio I generated with that here: https://simonwillison.net/2026/Jan/22/qwen3-tts/
- javier12345432116 days ago
  This is terrifying. With this and z-image-turbo, we've crossed a chasm. And a very deep one. We are currently protected by screens, we can, and should assume everything behind a screen is fake unless rigorously (and systematically, i.e. cryptographically) proven otherwise. We're sleepwalking into this, not enough people know about it.
  - rdtsc16 days ago
    That was my thought too. You’d have “loved ones” calling with their faces and voices asking for money in some emergency. But you’d also have plausible deniability as anything digital can be brushed off as “that’s not evidence, it could be AI generated”.
    rpdillon16 days ago
    Only if you focus on the form instead of the content. For a long time my family has had secret words and phrases we use to identify ourselves to each other over secure, but unauthenticated, channels (i.e. the channel is encrypted, but the source is unknown). The military has had to deal with this for some time, and developed various form of IFF that allies could use to identify themselves. E.g. for returning aircraft, a sequence of wing movements that identified you as friend. I think for a small group (in this case, loved ones), this could be one mitigation of that risk. My parents did this with me as a kid, ostensibly as a defense against some other adult saying "My mom sent me to pick you up...". I never did hear of that happening, though.
    nineteen99914 days ago
    That sounds way too complicated. I get around that by just not having any family any more.
    plagiarist16 days ago
    For now you could ask them to turn away from the camera while keeping their eyes open. If they are a Z-Image they will instantly snap their head to face you.
    muggermuch15 days ago
    This scenario is oddly terrifying.
    aprilthird202116 days ago
    > as anything digital can be brushed off as “that’s not evidence, it could be AI generated”.
    This won't change anything about Western style courts which have always required an unbroken chain of custody of evidence for evidence to be admissable in court
    cwillu16 days ago
    Court account for a vanishingly small proportion of most people's lives.
    aprilthird202115 days ago
    So does the presentation of evidence...
    neevans16 days ago
    this was already possible with chatterbox for a long while.
    freedomben16 days ago
    Yep, this has been the reality now for years. Scammers have already had access to it. I remember an article years ago about a grandma who wired her life savings to a scammer who claimed to have her granddaughter held hostage in a foreign country. Turns out they just cloned her voice from Facebook data and knew her schedule so timed it while she would be unreachable by phone.
    DANmode16 days ago
    or anyone who refuses to use hearing aids.
  - u808016 days ago
    https://www.youtube.com/watch?v=diboERFAjkE pretty much this
    harshreality16 days ago
    That's a reupload of Cybergem's video. https://www.youtube.com/watch?v=-gGLvg0n-uY
    javier12345432116 days ago
    Oh wow. Thank you for this. Amazing, terrifying, spot on, all of it.
    arcanemachiner16 days ago
    I knew what it would be before I even opened it. The crazy thing is that video is like 3 years old.
  - oceanplexian16 days ago
    > This is terrifying.
    Far more terrifying is Big Tech having access to a closed version of the same models, in the hands of powerful people with a history of unethical behavior (i.e. Zuckerberg's "Dumb Fucks" comments). In fact it's a miracle and a bit ironic that the Chinese would be the ones to release a plethora of capable open source models, instead of the scraps like we've seen from Google, Meta, OpenAI, etc.
    mrandish16 days ago
    > Far more terrifying is Big Tech having access to a closed version of the same model
    Agreed. The only thing worse than everyone having access to this tech is only governments, mega corps and highly-motivated bad actors having access. They've had it a while and there's no putting the genii back in the bottle. The best thing the rest of us can do is use it widely so everyone can adapt to this being the new normal.
    apitman15 days ago
    I know genii is the plural of genie, but for a second I thought it was a typo of genai and I kind of like that better.
    javier12345432116 days ago
    I do strongly agree. Though the societal impact is only mitigated by open models, not curtailed at all.
    refulgentis15 days ago
    The really terrifying thing is the next logical step from the instinctual reaction. Eschew miracle, eschew the cognitive bias of feeling warm and fuzzy for the guy who gives you it for free.
    Socratic version: how can the Chinese companies afford to make them and give them out for free? Cui bono?
    n.b. it's not because they're making money on the API, ex. open openrouter and see how Moonshot or DeepSeek's 1st party inference speed compares to literally any other provider. Note also that this disadvantage can't just be limited to LLMs, due to GPU export rules.
    vonneumannstan15 days ago
    >Far more terrifying is Big Tech having access to a closed version of the same models, in the hands of powerful people with a history of unethical behavior (i.e. Zuckerberg's "Dumb Fucks" comments).
    Lol what exactly do you think Zuck would do with your voice, drain your bank account??
    liamN14 days ago
    More likely sell your family ads while using your voice.
  - razster16 days ago
    I'd be a bit more worried with Z-Image Edit/Base is release. Flux.2 Klein is our and its on par with Zit, and with some fine tuning can just about hit Flux.2. Adding on top of that is Qwen Image Edit 2511 for additional refinement. Anything is possible. Those folks at r/StableDiffusion and falling over the possible release of Z-Image-Omni-Base, a hold me over until actual base is out. I've heard its equal to Flux.2. Crazy time.
  - TacticalCoder15 days ago
    > With this and z-image-turbo, we've crossed a chasm.
    And most of all: they're both local models. The cat is out of the box and it's never going back in. There's no censoring of this. No company that can pull the plug. Anyone with a semi-modern GPU can use these models.
  - fridder16 days ago
    Admittedly I have not dove into it much but, I wonder if we might finally have a usecase for NFTs and web3? We need some sort of way to denote items are persion generated not AI. Would certainly be easier than trying to determine if something is AI generated
    grumbel16 days ago
    That's the idea behind C2PA[1], your camera and the tools put a signature on the media to prove its provenance. That doesn't make manipulation impossible (e.g. you could photograph an AI image of a screen), but it does give you a trail of where a photo came from and thus an easier way to filter it or lookup the original.
    [1] https://c2pa.org/
    simonw16 days ago
    How would NFTs/web3 help differentiate between something created by a human and something that a human created with AI and then tagged with their signature using those tools?
    _kb16 days ago
    In a live conversation context you can mention the term NFTs/web3 and if the far end is human they'll wince a little.
    disillusioned15 days ago
    This made me laugh far too hard for far too long.
  - echelon16 days ago
    We're going to be okay.
    There are far more good and interesting use cases for this technology. Games will let users clone their voices and create virtual avatars and heroes. People will have access to creative tools that let them make movies and shows with their likeness. People that couldn't sing will make music.
    Nothing was more scary than the invention of the nuclear weapon. And we're all still here.
    Life will go on. And there will be incredible benefits that come out of this.
    javier12345432116 days ago
    I'm not denigrating the tech, all I'm saying is that we've crossed to new territory and there will be consequences that we don't understand from this. The same way that social media has been particularly detrimental to young people (especially women) in a way we were not ready for. This __smells__ like it could be worse, alongside with (or regardless of) the benefits of both.
    I simply think people don't really know that the new world requires a new set of rules of engagement for anything that exists behind a screen (for now).
    supern0va16 days ago
    We'll be okay eventually, when society adapts to this and becomes fully aware of the capabilities and the use cases for abuse. But, that may take some time. The parent is right to be concerned about the interim, at the very least.
    That said, I am likewise looking forward to the cool things to come out of this.
    cookiengineer16 days ago
    > We're going to be okay.
    > And there will be incredible benefits that come out of this.
    Your username is echelon.
    I just wanted to point that out.
    michelb15 days ago
    Yeah. Not using voice, but...https://nymag.com/intelligencer/article/white-house-posts-fa...
    doug71370516 days ago
    > Nothing was more scary than the invention of the nuclear weapon. And we're all still here.
    Except that building a nuclear weapon was not available to everyone, certainly not to dumb people whose brain have been feeded with social media content.
    lynx9715 days ago
    I usually don't correct typos and/or grammar, but you asked for it. Calling random people "dumb" while using an incorrect past tense is pretty funny. It is "fed", not "feeded"...
    DANmode16 days ago
    > People that couldn't sing will make music.
    I was with you, until
    But, yeah. Life will go on.
    echelon16 days ago
    There are plenty of electronic artists who can't sing. Right now they have to hire someone else to do the singing for them, but I'd wager a lot of them would like to own their music end-to-end. I would.
    I'm a filmmaker. I've done it photons-on-glass production for fifteen years. Meisner trained, have performed every role from cast to crew. I'm elated that these tools are going to enable me to do more with a smaller budget. To have more autonomy and creative control.
    DANmode16 days ago
    What happens to lyricless electronica if suddenly every electronic artist has quality vocal-backing?
    Oh no.
    Maybe we did frig this up.
    fc417fc80215 days ago
    On the other hand, maybe we'll get models capable of removing the lyrics from things without damaging the rest of the audio. Or better yet, replacing the lyrics with a new instrument. So it might yet work out in our favor.
    DANmode15 days ago
    This was one of the first things they were doing with neural nets,
    and there are even a couple SaaS options for it now.
    echelon16 days ago
    More choices for artists is not a bad thing.
    DANmode15 days ago
    Indeed.
    But it does change who can be an artist in each niche,
    and that’s been interesting to briefly pause and consider here with the community.
    redwall_hp16 days ago
    We've had Yamaha Vocaloid for over two decades now, and Synthesizer V is probably coming up on a decade too now. They're like any other synth: MIDI (plus phonemes) in, sound out. It's a tool of musical expression, like any other instrument.
    Hatsune Miku (Fujita Saki) is arguably the most prolific singer in the world, if you consider every Vocaloid user and the millions of songs that have come out of it.
    So I don't think there's any uncharted territory...we still have singers, and sampled VST instruments didn't stop instrumentalists from existing; if anything, most of these newcomer generative AI tools are far less flexible or creatively useful than the vast array of synthesis tools musicians already use.
    fc417fc80215 days ago
    Miku is neat but not a replacement for a human by any stretch of the imagination. In practice most amateur usage of that lands somewhere in a cringey uncanny valley.
    No one was going to replace voice actors for TV and movie dubs with Miku whereas the cutting edge TTS tools seem to be nearing that point. Presumably human vocal performances will follow that in short order.
    javier12345432116 days ago
    Yes, the flipside of this is that we're eroding the last bit of ability for people to make a living through their art. We are capturing the market for people to live off of making illustrations, to making background music, jingles, promotional videos, photographs, graphic design, and funnelling those earnings to NVIDIA. The question I keep asking is whether we care to value as a society for people to make a living through their art. I think there is a reason to care.
    It's not so much of an issue with art for art's sake aided by AI. It's an issue with artistic work becoming unviable work.
    volkercraig16 days ago
    This feels like one of those tropes that keeps showing up whenever new tech comes out. At the advent of recorded music, im sure buskers and performers were complaing that live music is dead forever. Stage actors were probably complaining that film killed plays. Heck, I bet someome even complained that video itself killed the radio star. Yet here we are, hundreds of years later, live music is still desirable, plays still happen, and faceless voices are still around, theyre just called v-tubers and podcasters.
    fc417fc80215 days ago
    > This feels like one of those tropes that keeps showing up whenever new tech comes out.
    And this itself is another tired trope. Just because you can pattern match and observe that things repeatedly went a certain way in the past, doesn't mean that all future applications of said pattern will play out the same way. On occasion entire industries have been obliterated without a trace by technological advancement.
    We can also see that there must be some upper ceiling on what humans in general are capable of - hit that and no new jobs will be created because humans simply won't be capable of the new tasks. (Unless we fuse with the machines or genetically engineer our brains or etc but I'm choosing to treat those eventualities as out of scope.)
    Urahandystar15 days ago
    Give me one aspect in which that has actually happened? I'm wracking my brains but can't think of one. We are a weird species in that even if we could replace ourselves our fascination with ourselves means that we don't ever do it. Cars and bicycles have replaced our ability to travel at great and small distances and yet we still have track events culminating in the olympics.
    fc417fc80215 days ago
    Sure, things continue to persist as a hobby, a curiosity, a bespoke luxury, or the like. But that's not at all the same thing as an industry. Only the latter is relevant if we're talking about the economy and employment prospects and making a living and such.
    It's a bit tricky to come up with concrete examples on the spot, in particular because drawing a line around a given industry or type of work is largely subjective. I could point to blacksmithing and someone could object that we still have metalworkers. But we don't have individual craftsmen hammering out pieces anymore. Someone might still object that an individual babysitting a CNC machine is analogous but somehow it feels materially different to me.
    Leather workers are another likely example. To my mind that's materially different from a seamstress, a job that itself has had large parts of the tasks automated.
    Horses might be a good example. Buggies and carriages replaced by the engine. Most of the transportation counterparts still exist but I don't think mechanics are really a valid counterpart to horse tenders and all the (historic) economic activity associated with that. Sure a few rich people keep race horses but that's the sort of luxury I was referring to above. The number of related job positions is a tiny fraction of what it was historically and exists almost solely for the purpose of entertaining rich people.
    Historically the skill floor only crept up at a fairly slow rate so the vast majority of those displaced found new sectors to work in. But the rate of increase appears to have picked up to an almost unbelievable clip (we're literally in the midst of redefining the roles of software developers of all things, one of the highest skilled "bulk" jobs out there). It should be obvious that if things keep up the way they've been going then we're going to hit a ceiling for humans as a species not so long from now.
    redwall_hp16 days ago
    Tin Pan Alley is the historical industry from before recording: composers sold sheet music and piano rolls to publishers, who sold them to working musicians. The ASCAP/BMI mafia would shake down venues and make sure they were paying licensing fees.
    Recorded music and radio obviously reduced the demand for performers, which reduced demand for sheets.
    javier12345432116 days ago
    umm, I don't know if you've seen the current state of trying to make a living with music but It's widely accepted as dire. Touring is a loss leader, putting out music for free doesn't pay, stream counts payouts are abysmally low. No one buys songs.
    All that is before the fact that streaming services are stuffing playlists with AI generated music to further reduce the payouts to artists.
    > Yet here we are, hundreds of years later, live music is still desirable, plays still happen, and faceless voices are still around...
    Yes all those things still happen, but it's increasingly untenable to make a living through it.
    cthalupa16 days ago
    Artists were saying this even before streaming, though, much less AI.
    I listen pretty exclusively to metal, and a huge chunk of that is bands that are very small. I go to shows where they headliners stick around at the bar and chat with people. Not saying this to be a hipster - I listen to plenty of "mainstream" stuff too - but to show that it's hard to get smaller than this when it comes to people wanting to make a living making music.
    None of them made any money off of Spotify or whatever before AI. They probably don't notice a difference, because they never paid attention to the "revenue" there either.
    But they do pay attention to Bandcamp. Because Bandcamp has given them more ability to make money off the actual sale of music than they've had in their history - they don't need to rely on a record deal with a big label. They don't need to hope that the small label can somehow get their name out there.
    For some genres, some bands, it's more viable than ever before to make a living. For others, yeah, it's getting harder and harder.
    volkercraig15 days ago
    Is it though? Think about being a musician 200 years ago. In 1826 you needed to essentially be nobility or nobility-adjacent just to be able to touch an instrument let alone make a living from it. 100 years later, 1926 the barrier to entry was still sky high, nobody could make and distribute recordings without extensive investment. Nowadays it's not uncommon for a 17 year old to download some free composer software, sign up for a few accounts and distribute their music to an audience of millions. It's not easy to do, sure, but there is still opportunity that never existed. If you were to take at random a 20 year old from the general population in 1826, 1923, 1943, 1953, 1973, 83, etc, would you REALLY say that any of them have a BETTER opportunity than today?
    patrickdavey16 days ago
    But this is different? Wholesale copying of copyrighted works and packaging it up and allowing it to be generated. It's not remotely reasonable
    lynx9715 days ago
    The amount of artists that managed to actually earn enough to pay the rent and bills was already very very small before AI emerged. I totally agree with you, its heartbreaking to watch how it got even worse, but, the music industry already shuffled the big money to the big players way before AI.
- magicalhippo16 days ago
  The HF demo space was overloaded, but I got the demo working locally easily enough. The voice cloning of the 1.7B model captures the tone of the speaker very well, but I found it failed at reproducing the variation in intonation, so it sounds like a monotonous reading of a boring text.
  I presume this is due to using the base model, and not the one tuned for more expressiveness.
  edit: Or more likely, the demo not exposing the expressiveness controls.
  The 1.7B model was much better at ignoring slight background noise in the reference audio compared to the 0.6B model though. The 0.6B would inject some of that into the generated audio, whereas the 1.7B model would not.
  Also, without FlashAttention it was dog slow on my 5090, running at 0.3X realtime with just 30% GPU usage. Though I guess that's to be expected. No significant difference in generation speed between the two models.
  Overall though, I'm quite impressed. I haven't checked out all the recent TTS models, but a fair number, and this one is certainly one of the better ones in terms of voice cloning quality I've heard.
  - thedangler16 days ago
    How did you do this locally? Tools? Language?
    magicalhippo16 days ago
    I just followed the Quickstart[1] in the GitHub repo, refreshingly straight forward. Using the pip package worked fine, as did installing the editable version using the git repository. Just install the CUDA version of PyTorch[2] first.
    The HF demo is very similar to the GitHub demo, so easy to try out.
    pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128 pip install qwen3-tts qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --no-flash-attn --ip 127.0.0.1 --port 8000
    That's for CUDA 12.8, change PyTorch install accordingly.
    Skipped FlashAttention since I'm on Windows and I haven't gotten FlashAttention 2 to work there yet (I found some precompiled FA3 files[3] but Qwen3-TTS isn't FA3 compatible yet).
    [1]: https://github.com/QwenLM/Qwen3-TTS?tab=readme-ov-file#quick...
    [2]: https://pytorch.org/get-started/locally/
    [3]: https://windreamer.github.io/flash-attention3-wheels/
    dur-randir16 days ago
    https://github.com/sdbds/flash-attention-for-windows/release... - FA2 binaries for you
    regularfry15 days ago
    It flat didn't work for me on mps. CUDA only until someone patches it.
    magicalhippo15 days ago
    Demo ran fine, if very slowly, with CPU-only using "--device cpu" for me. It defaults to CUDA though.
    Try using mps I guess, I saw multiple references to code checking if device is not mps, so seems like it should be supported. If not, CPU.
  - dsrtslnd2316 days ago
    Any idea on the VRAM footprint for the 1.7B model? I guess it fits on consumer cards but I am wondering if it works on edge devices.
    magicalhippo16 days ago
    The demo uses 6GB dedicated VRAM on Windows, but keep in mind that it's without FlashAttention. I expect it would drop a bit if I got that working.
    Haven't looked into the demo to see if it could be optimized by moving certain bits to CPU for example.
- pseudosavant16 days ago
  Remarkable tech that is now accessible to almost anyone. My cloned voice sounded exactly like me. The uses for this will be from good to bad and everywhere in-between. A deceased grandmother reading "Good Night Moon" to grandkids, scamming people, the ability to create podcasts with your own voices from just prompts.
  - _kb16 days ago
    It's a good thing governments (https://www.ato.gov.au/online-services/voice-authentication) and banks (https://www.anz.com.au/security/how-we-protect-you/voice-id/) haven't gone all in on using voice as an authentication mechanism.
- parentheses15 days ago
  I got some errors trying to run this on my MBP. Claude was able to one-shot a fix.
``` Loaded speech tokenizer from ~/.cache/huggingface/hub/models--Qwen--Qwen3-TTS-12Hz-1.7B-VoiceDesign/snapshots/0e711a1c0aa5aad30654426 e0d11f67716c1211e/speech_tokenizer Fetching 11 files: 0%| | 0/11 [00:00<?, ?it/s]Fetching 11 files: 100%|| 11/11 [00:00<00:00, 125033.45it/s] The tokenizer you are loading from '!/.cache/huggingface/hub/models--Qwen--Qwen3-TTS-12Hz-1.7B-VoiceDesign/snapshots/0e711a1c0aa5aad30654426e0d11f67716c1211e' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instr.... This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue. ```
- cristoperb16 days ago
  I cloned my voice and had it generate audio for a paragraph from something I wrote. It definitely kind of sounds like me, but I like it much better than listening to my real voice. Some kind of uncanny peak.
  - viraptor16 days ago
    They weirdly makes it a canny peak though :)
  - bsenftner15 days ago
    You do realize that you don't hear your real voice normally, an individual has to record their voice to hear how others hear their voice. What you hear when you speak includes your skull resonating, which other's do not hear.
- mohsen116 days ago
  > The requested GPU duration (180s) is larger than the maximum allowed
  What am I doing wrong?
  - gregsadetsky16 days ago
    you need to login
- KolmogorovComp16 days ago
  Hello, the recording you posted does not tell much about the cloning capability without an example from your real voice.
  - simonw16 days ago
    Given how easy voice cloning is with this thing I chickened out of sharing the training audio I recorded!
    That's not really rational considering the internet is full of examples of my voice that anyone could use though. Here's a recent podcast clip: https://www.youtube.com/watch?v=lVDhQMiAbR8&t=3006s
    KolmogorovComp16 days ago
    Thanks, so it’s in the [pretty close but still distinguishable] range.
    genewitch16 days ago
    it depends on the medium. A flac will be distinguishable (for now); but put it out over low bandwidth media and you get https://youtube.com/shorts/dpScfg3how8 which, https://www.youtube.com/shorts/zi7BVqVzRx4 is real good and close to what that podcaster's voice sounded like when i made that clone!
    i have several other examples from before my repeater ID voice clone. Newer voice models will have to wait till i recover my NAS tomorrow!
    this is the newest one i have access to: Dick Powell voice clone off his Richard Diamond Persona: https://soundcloud.com/djoutcold/dick-powell-voice-clone-tes...
    i was one-shotting voices years ago that were timbre/tonally identical to the reference voice; however the issue i had was inflection and subtlety. I find that female voices are much easier to clone, or at least it fools my brain into thinking so.
    this model, if the results weren't too cherry picked, will be huge improvement!
- kingstnap16 days ago
  It was fun to try out. I wonder if at some point if I have a few minutes of me talking I could make myself read an entire book to myself.
- itsTyrion15 days ago
  well that isnt concerning at all
simonw16 days ago
I got this running on macOS using mlx-audio thanks to Prince Canuma: https://x.com/Prince_Canuma/status/2014453857019904423
Here's the script I'm using: https://github.com/simonw/tools/blob/main/python/q3_tts.py
You can try it with uv (downloads a 4.5GB model on first run) like this:
```
  uv run https://tools.simonwillison.net/python/q3_tts.py \
    'I am a pirate, give me your gold!' \
    -i 'gruff voice' -o pirate.wav
```
- genewitch16 days ago
  If i am ever in the same city as you, i'll buy you dinner. I poked around during my free time today trying to figure out how to run these models, and here is the estimable Simon Willison just presenting it on a platter.
  hopefully i can make this work on windows (or linux, i guess).
  thanks so much.
  - cube0015 days ago
    > hopefully i can make this work on windows (or linux, i guess).
    mlx-audio only works on Apple Silicon
    bigyabai15 days ago
    The original script supports CPU inference, nonetheless.
- rahimnathwani15 days ago
  If you want to do custom voice cloning, record a sample wav file with a sentence or two, and then try this:
  uv tool install --force git+https://github.com/Blaizzy/mlx-audio.git --prerelease=allow python -m mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16 --text "Hello, this is a test." --ref_audio path_to_audio.wav --ref_text "Transcript of the reference audio." --play
- indigodaddy16 days ago
  Simon how do you think this would perform on CPU only? Lets say threadripper with 20G ram. (Voice cloning in particular)
  - simonw16 days ago
    No idea at all, but my guess is it would work but be a bit slow.
    You'd need to use a different build of the model though, I don't think MLX has a CPU implementation.
  - genewitch16 days ago
    the old voice cloning and/or TTS models were CPU only, and they weren't realtime, but no worse than 2:1, 30 seconds of audio would take 60 seconds to generate. roughly. in 2021 one-shot TTS/cloning using GPUs was getting there, and that was close enough to realtime; one could, if one was willing to deal with it, wire microphone audio to the model, and speak words, and the model would, in real time, modify the voice. Phil Hendrie is jealous.
    anyhow, with faster CPUs and optimizations, you won't be waiting too long. Also 20GB is overkill for an audio model. Only text - LLM - are huge and take infinite memory. SD/FLUX models are under 16GB of ram usage (uh, mine are, at least!), for instance.
- gcr16 days ago
  This is wonderful, thank you. Another win for uv!
TheAceOfHearts16 days ago
Interesting model, I've managed to get the 0.6B param model running on my old 1080 and I can generated 200 character chunks safely without going OOM, so I thought that making an audiobook of the Tao Te Ching would be a good test. Unfortunately each snippet varies drastically in quality: sometimes the speaker is clear and coherent, but other times it bursts out laughing or moaning. In a way it feels a bit like magical roulette, never being quite certain of what you're going to get. It does have a bit of charm, when you chain the various snippets together you really don't know what direction it's gonna go.
Using speaker Ryan seems to be the most consistent, I tried speaker Eric and it sounded like someone putting on a fake exaggerated Chinese accent to mock speakers.
If it wasn't for the unpredictable level of emotions from each chunk, I'd say this is easily the highest quality TTS model I've tried.
- KaoruAoiShiho16 days ago
  Have you tried specifying the emotion? There's an option to do so and if it's left empty it wouldn't surprise me if it defaulted to rng instead of bland.
  - TheAceOfHearts16 days ago
    For the system prompt I used:
    > Read this in a calm, clear, and wise audiobook tone.
    > Do not rush. Allow the meaning to sink in.
    But maybe I should experiment with something more detailed. Do you have any suggestions?
    KaoruAoiShiho16 days ago
    Something like this:
    Character Name: Marcus Cole Voice Profile: A bright, agile male voice with a natural upward lift, delivering lines at a brisk, energetic pace. Pitch leans high with spark, volume projects clearly—near-shouting at peaks—to convey urgency and excitement. Speech flows seamlessly, fluently, each word sharply defined, riding a current of dynamic rhythm. Background: Longtime broadcast booth announcer for national television, specializing in live interstitials and public engagement spots. His voice bridges segments, rallies action, and keeps momentum alive—from voter drives to entertainment news. Presence: Late 50s, neatly groomed, dressed in a crisp shirt under studio lights. Moves with practiced ease, eyes locked on the script, energy coiled and ready. Personality: Energetic, precise, inherently engaging. He doesn’t just read—he propels. Behind the speed is intent: to inform fast, to move people to act. Whether it’s “text VOTE to 5703” or a star-studded tease, he makes it feel immediate, vital.
- dsrtslnd2316 days ago
  do you have the RTF for the 1080? I am trying to figure out if the 0.6B model is viable for real-time inference on edge devices.
  - TheAceOfHearts16 days ago
    Yeah, it's not great. I wrote a harness that calculates it as: 3.61s Load Time, 38.78s Gen Time, 18.38s Audio Len, RTF 2.111.
    The Tao Te Ching audiobook came in at 62 mins in length and it ran for 102 mins, which gives an RTF of 1.645.
    I do get a warning about flash-attn not being installed, which says that it'll slow down inference. I'm not sure if that feature can be supported on the 1080 and I wasn't up for tinkering to try.
    storystarling16 days ago
    An RTF above 1 for just 0.6B parameters suggests the bottleneck isn't the GPU, even on a 1080. The raw compute should be much faster. I'd bet it's mostly CPU overhead or an issue with the serving implementation.
    genewitch16 days ago
    you can install flash attention, et al, but if you're on windows, afaik, you can't use/run/install "triton kernels", which apparently make audio models scream. Whisper complains every time i start it, and it is pretty slow; so i just batch hundreds of audio files on a machine in the corner with a 3060 instead. technically i could batch them on a CPU, too, since i don't particularly care when they finish.
genewitch16 days ago
it isn't often that tehcnology gives me chills, but this did it. I've used "AI" TTS tools since 2018 or so, and i thought the stuff from two years ago was about the best we were going to get. I don't know the size of these, i scrolled to the samples. I am going to get the models set up somewhere and test them out.
Now, maybe the results were cherrypicked. i know everyone else who has released one of these cherrypicks which to publish. However, this is the first time i've considered it plausible to use AI TTS to remaster old radioplays and the like, where a section of audio is unintelligible but can be deduced from context, like a tape glitch where someone says "HEY [...]LAR!" and it's an episode of Yours Truly, Johnny Dollar...
I have dozens of hours of audio of like Bob Bailey and people of that era.
- kamranjon16 days ago
  I wonder if it was trained on anime dubs cause all of the examples I listened to sounded very similar to a miyazaki style dub.
  - genewitch16 days ago
    scroll down to the second to last group, the second one down is obama speaking english, the third one down is trump speaking japanese (a translation of the english phrase)
    besides, they know what side their bread is buttered on. I feel like this is almost not the real announcement; or, the engineers that wrote this up and did the demos just ran it that way. The normal speech voices are fine (lower than the anime ones on the page.) i agree that the first few are very infantile. I'll change that word if i can think of a better one.
- freedomben16 days ago
  Indeed, I have a future project/goal of "restoring" Have Gun - Will Travel radio episodes to listenable quality using tech like this. There are so many lines where sound effects and tape rot and other "bad recording" things make it very difficult to understand what was sad. Will be amazing, but as with all tech the potential for abuse is very real
  - genewitch16 days ago
    hey if you want to collab or trade notes, my email is in my profile. there was java software that did FANTASTIC work cleaning up crappy transfers of audio, like, specifically, it was perfect for "AM Quality Monaural Audio".
    Observe, original: https://www.youtube.com/watch?v=YiRcOVDAryM my edit (took about an hour, if memory serves, to set up. forgot render time...): https://www.youtube.com/watch?v=xazubVJ0jz4
    i say "was [...] software" because the last 2 times i've tried to use it, it did imperceptible cleanup, making it worthless. Anyhow, all my radio plays are from OTRR, i think.
    Audio.Restoration.DeNoise.DeNoiseLF.2.8.3_WiN.OSX is a more recent version i think
    p.s. are you a "dude named Ben"?
    freedomben15 days ago
    Neat! That's really cool. I'll definitely reach out once I'm ready to move forward on it. Got a few high-priority things sucking up all my free time at the moment :-(
    Yeah all my radio plays are from OTRR now. I bought a number of different "collections" from different sources but none of them come even close to the quality and care that the OTRR people have.
    Also, always a pleasure to meet someone else who loves old-time radio :-D
    What are some of your favorites? Probably my favorite is Abbott & Costello, followed by Have Gun - Will Travel and Gunsmoke. I like the Lone Ranger too but am only a few hours into it so far.
    p.s. I am indeed a dude named Ben!
    genewitch13 days ago
    As far as old time radio: Yours Truly, Johnny Dollar; Richard Diamond (and Rogue's Gallery, same star and crew), and Philip Marlow. I try to get back in to the ones i listened to as a kid 40 odd years ago like Dragnet and Broadway is My Beat, but there's something too slow about them for me now. I like Suspense, as well. I'd have to go digging for some other shows i've enjoyed, as the YT,JD set is hundreds and hundreds of episodes long and keeps me entertained on long car trips...
    I also have some newer things, i'm trying to fill my "Coast-to-Coast AM" collection, i've started on Phil Hendrie, and i have most of a show called "Love Line" hosted by Dr Drew Pinsky (and others, usually Adam Carolla). That one i asked permission to clone an archive i found on accident, and the archive owner/host was glad i was doing it. Those are all transcribed, now.
    Each is kind of a snapshot of the time they existed, and i don't necessarily want to listen to them all (i don't like Art Bell that much, or talk radio in general!)
    And if the reference to "Dude Named Ben" is correct, ITM, and i hope to hear from you when we both have time to fix these old shows!
- 16 days ago
  undefined
throwaw1216 days ago
Qwen team, please please please, release something to outperform and surpass the coding abilities of Opus 4.5.
Although I like the model, I don't like the leadership of that company and how close it is, how divisive they're in terms of politics.
- mortsnort16 days ago
  They were just waiting for someone in the comments to ask!
  - zeppelin10116 days ago
    Someone has to take the first step. Let's be grateful to the brave anon HN poster for stepping up.
  - mhuffman16 days ago
    It really is the best way to incentivize politeness!
  - stuckkeys16 days ago
    I loled hard at this. Thank you kind stranger.
- pseudony16 days ago
  Same issue (I am Danish).
  Have you tested alternatives? I grabbed Open Code and a Minimax m2.1 subscription, even just the 10usd/mo one to test with.
  Result? We designed a spec for a slight variation of a tool for which I wrote a spec with Claude - same problem (process supervisor tool), from scratch.
  Honestly, it worked great, I have played a little further with generating code (this time golang), again, I am happy.
  Beyond that, Glm4.7 should also be great.
  See https://dev.to/kilocode/open-weight-models-are-getting-serio...
  It is a recent case story of vibing a smaller tool with kilo code, comparing output from minimax m2.1 and Glm4.7
  Honestly, just give it a whirl - no need to send money to companies/nations your disagree with with.
  - nunodonato16 days ago
    I've been using GLM 4.7 with Claude Code. best of both worlds. Canceled my Anthropic subscription due to the US politics as well. Already started my "withdrawal" in Jan 2025, Anthropic was one of the few that was left
    bigyabai16 days ago
    I'm in the same boat. Sonnet was overkill for me, and GLM is cheap and smart enough to spit out boilerplate and FFMPEG commands whenever it's asked.
    $20/month is a bit of an insane ask when the most valuable thing Anthropic makes is the free Claude Code CLI.
    mikenew16 days ago
    I've recently switched to OpenCode and found it to be far better. Plus GML 4.7 is free at the moment, so for now it's a great no-cost setup.
    stavros16 days ago
    I don't know, I max out my Opus limits regularly. I guess it depends on usage.
    Mashimo15 days ago
    > I'm in the same boat. Sonnet was overkill for me, and GLM is cheap and smart enough to spit out boilerplate and FFMPEG commands whenever it's asked.
    Do you even need an subscription to any service for that? Is a free tier not enough?
    dsrtslnd2316 days ago
    Are you using an API proxy to route GLM into the Claude Code CLI? Or do you mean side-by-side usage? Not sure if custom endpoints are supported natively yet.
    sumedh15 days ago
    This works: $ZAI_ANTHROPIC_BASE_URL=xxx $ZAI_ANTHROPIC_AUTH_TOKEN=xxx
    alias "claude-zai"="ANTHROPIC_BASE_URL=$ZAI_ANTHROPIC_BASE_URL ANTHROPIC_AUTH_TOKEN=$ZAI_ANTHROPIC_AUTH_TOKEN claude"
    Then you can run `claude`, hit your limit, exit the session and `claude-zai -c` to continue (with context reset, of course).
    Someone gave me that command a while back.
    nunodonato14 days ago
    thats pretty much what I do, I have a bash alias to launch either the normal claude code, or the glm one
    Mashimo15 days ago
    This is the official guide: https://docs.z.ai/devpack/tool/claude
    stavros16 days ago
    I much prefer OpenCode these days, give it a try.
    nunodonato16 days ago
    I did, I couldnt get used to it and didn't get so good results. I think Claude Code's tools are really top notch, and maybe the system prompt
- TylerLives16 days ago
  >how divisive they're in terms of politics
  What do you mean by this?
  - throwaw1216 days ago
    Dario said not nice words about China and open models in general:
    https://www.bloomberg.com/news/articles/2026-01-20/anthropic...
    vlovich12316 days ago
    I think the least politically divisive issue within the US is concern about China’s growth as it directly threatens the US’s ability to set the world’s agenda. It may be politically divisive if you are aligned with Chinese interests but I don’t see anything politically divisive for a US audience. I expect Chinese CEOs speak in similar terms to a Chinese audience in terms of making sure they’re decoupled from the now unstable US political machine.
    cmrdporcupine16 days ago
    "... for a US audience"
    And that's the rub.
    Many of us are not.
    subscribed16 days ago
    Looking at the last year's US agenda I'm okay with that.
    Levitz16 days ago
    I mean, there's no way it's about this right?
    Being critical of favorable actions towards a rival country shouldn't be divisive, and if it is, well, I don't think the problem is in the criticism.
    Also the link doesn't mention open source? From a google search, he doesn't seem to care much for it.
    giancarlostoro16 days ago
    From the perspective of competing against China in terms of AI the argument against open models makes sense to me. It’s a terrible problem to have really. Ideally we should all be able to work together in the sandbox towards a better tomorrow but thats not reality.
    I prefer to have more open models. On the other hand China closes up their open models once they start to show a competitive edge.
  - Balinares16 days ago
    They're supporters of the Trump administration's military, a view which is not universally lauded.
- mohsen116 days ago
  With a good harness I am getting similar results with GLM 4.7. I am paying for TWO! max accounts and my agents are running 24/7.
  I still have a small Claude account to do some code reviews. Opus 4.5 does good reviews but at this point GLM 4.7 usually can do the same code reviews.
  If cost is an issue (for me it is, I pay out of pocket) go with GLM 4.7
  - imiric16 days ago
    Your GitHub profile is... disturbing. 1,354 commits and 464 pull requests in January so far.
    Regardless of how productive those numbers may seem, that amount of code being published so quickly is concerning, to say the least. It couldn't have possibly been reviewed by a human or properly tested.
    If this is the future of software development, society is cooked.
    mohsen116 days ago
    It's mostly trying out my orchestration system (https://github.com/mohsen1/claude-code-orchestrator and https://github.com/mohsen1/claude-orchestrator-action) in a repo using GH_PAT.
    Stuff like this: https://github.com/mohsen1/claude-code-orchestrator-e2e-test...
    Yes, the idea is to really, fully automate software engineering. I don't know if I am going to be successful but I'm on vacation and having fun!
    if Opus 4.5/GLM 4.7 can do so much already, I can only imagine what can be done in two years. Might as well adopt to this reality and learn how leverage this advancement
    azuanrb15 days ago
    On the contrary, that actually is pretty cool. z.ai subscription is cheap enough that I'm thinking to run it 24/7 too. Curious if you've tried any other AI orchestration tools like Gas Town? What made you decide to build your own, and how is it working for you so far?
    mohsen115 days ago
    I didn't know about Gas Town! Super cool! I will try it once I have a chance. I started with a few dumb Tmux based scripts and eventually I figured I make it into a proper package.
    I think using GitHub with issues,PRs and specially leveraging AI code reviewers like Greptile is the way to go Actually. I did an attempt here https://github.com/mohsen1/claude-orchestrator-action but I think it needs a lot more attention to get it right. Ideas in Gas Town are great and I might steal some of those. Running Claude Code in GitHub Action works with GLM 4.7 great.
    Microsoft's new Agent SDK is also interesting. Unlocks multi-provider workflows so user can burn out all of their subscriptions or quickly switch providers
    Also super interested in collaborating with someone to build something together if you are interested!
    gcr16 days ago
    You may not like it but this is what a 10x developer looks like. :-)
    genewitch16 days ago
    you may enjoy spaghetti, but will you enjoy 10x spaghetti?
- amrrs16 days ago
  Have you tried the new GLM 4.7?
  - davely16 days ago
    I've been using GLM 4.7 alongside Opus 4.5 and I can't believe how bad it is. Seriously.
    I spent 20 minutes yesterday trying to get GLM 4.7 to understand that a simple modal on a web page (vanilla JS and HTML!) wasn't displaying when a certain button was clicked. I hooked it up to Chrome MCP in Open Code as well.
    It constantly told me that it fixed the problem. In frustration, I opened Claude Code and just typed "Why won't the button with ID 'edit' work???!"
    It fixed the problem in one shot. This isn't even a hard problem (and I could have just fixed it myself but I guess sunk cost fallacy).
    bityard16 days ago
    I've used a bunch of the SOTA models (via my work's Windsurf subscription) for HTML/CSS/JS stuff over the past few months. Mind you, I am not a web developer, these are just internal and personal projects.
    My experience is that all of the models seem to do a decent job of writing a whole application from scratch, up to a certain point of complexity. But as soon as you ask them for non-trivial modifications and bugfixes, they _usually_ go deep into rationalized rabbit holes into nowhere.
    I burned through a lot of credits to try them all and Gemini tended to work the best for the things I was doing. But as always, YMMV.
    KolmogorovComp16 days ago
    Exactly the same feedback
    Balinares16 days ago
    Amazingly, just yesterday, I had Opus 4.5 crap itself extensively on a fairly simple problem -- it was trying to override a column with an aggregation function while also using it in a group-by without referring to the original column by its full qualified name prefixed with the table -- and in typical Claude fashion it assembled an entire abstraction layer to try and hide the problem under, before finally giving up, deleting the column, and smugly informing me I didn't need it anyway.
    That evening, for kicks, I brought the problem to GLM 4.7 Flash (Flash!) and it one-shot the right solution.
    It's not apples to apples, because when it comes down to it LLMs are statistical token extruders, and it's a lot easier to extrude the likely tokens from an isolated query than from a whole workspace that's already been messed up somewhat by said LLM. That, and data is not the plural of anecdote. But still, I'm easily amused, and this amused me. (I haven't otherwise pushed GLM 4.7 much and I don't have a strong opinion about about it.)
    But seriously, given the consistent pattern of knitting ever larger carpets to sweep errors under that Claude seems to exhibit over and over instead of identifying and addressing root causes, I'm curious what the codebases of people who use it a lot look like.
    girvo16 days ago
    > I can't believe how bad it is
    This has been my consistent experience with every model prior to Opus 4.5, and every single open model I've given a go.
    Hopefully we will get there in another 6 months when Opus is distilled into new open models, but I've always been shocked at some of the claims around open models, when I've been entirely unable to replicate them.
    Hell, even Opus 4.5 shits the bed with semi-regularity on anything that's not completely greenfield for my usage, once I'm giving it tasks beyond some unseen complexity boundary.
  - throwaw1216 days ago
    yes I did, not on par with Opus 4.5.
    I use Opus 4.5 for planning, when I reach my usage limits fallback to GLM 4.7 only for implementing the plan, it still struggles, even though I configure GLM 4.7 as both smaller model and heavier model in claude code
- WarmWash16 days ago
  The Chinese labs distill the SOTA models to boost the performance of theirs. They are a trailer hooked up (with a 3-6 month long chain) to the trucks pushing the technology forwards. I've yet to see a trailer overtake it's truck.
  China would need an architectural breakthrough to leap American labs given the huge compute disparity.
  - miklosz16 days ago
    I have seen indeed a trailer overtake its truck. Not a beautiful view.
    digdugdirk16 days ago
    Agreed. I do think the metaphor still holds though.
    A financial jackknifing of the AI industry seems to be one very plausible outcome as these promises/expectations of the AI companies starts meeting reality.
  - overfeed16 days ago
    Care to explain how the volume of AI research papers authored by Chinese researchers[1] has exceeded US-published ones? Time-traveling plagiarism perhaps, since you believe the US is destined to lead always.
    1. Chinese researcher in China, to be more specific.
    bfeynman16 days ago
    Not a great metric, research in academia doesn't necessarily translate to value. In the US they've poached so many academics because of how much value they directly translate to.
    WarmWash16 days ago
    I don't doubt China wouldn't be capable of making SOTA models, however they are very heavily compute constrained. So they are forced to shortcut compute by riding the coattails of compute heavy models.
    They need a training-multiplier breakthrough that would allow them to train SOTA models on on a fraction of the compute that the US does. And this would also have to be kept a secret and be well hidden (often multiple researchers from around the world put the pieces together on a problem at around the same time, so the breakthrough would have to be something pretty difficult to discover for the greatest minds in the field) to prevent the US from using it to multiply their model strength with their greater compute.
    jacquesm16 days ago
    Volume is easy: they have far more people, it is quality that counts.
    numpad016 days ago
    Yeah, and if anything it's US defying massive disadvantage in headcount is what is odd, not the other way around.
    1: https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population
    overfeed16 days ago
    Perhaps you should pay attention to where the puck is going to be, rather than where it is currently. Lots of original ideas are coming out of Chinese AI research[1], denying this betrays some level of cope.
    1. e.g. select any DeepSeek release, and read the accompanying paper
    jacquesm16 days ago
    I'll pay attention to where the puck is because that is something I can observe, where it is going to be is anybody's guess. Lots of original ideas are coming out of Chinese AI research but there is also lots of junk. I think in the longer term they will have the advantage but right now that simply isn't the case.
    Your 'cope' accusation has no place here, I have no dog in the race and do not need to cope with anything.
    overfeed16 days ago
    > Your 'cope' accusation has no place here
    I will rephrase my statement and continue to stand by it: "Denying the volume of original AI research being done by China - a falsifiable metric - betrays some level of cope."
    You seem to agree on the fact that China has surpassed the US. As for quality, I'll say expertise is a result of execution. At some point in time during off-shoring, the US had qualitatively better machinists that China, despite manufacturing volumes. That is no longer the case today - as they say, cream floats to the top, and that holds true for a pot or an industrial-sized vat.
    popalchemist16 days ago
    It may not be cope, could just be ignorance.
    sieabahlpark16 days ago
    [dead]
  - aaa_aaa16 days ago
    No all they need is time. I am awaiting the dowfall of the ai hegemony and hype with popcorn at hand.
  - mhuffman16 days ago
    I would be happy with an openweight 3 month old Claude
    cmrdporcupine16 days ago
    DeepSeek 3.2 is frankly fairly close to that. GLM 4.7 as well. They're basically around Sonnet 4 level.
  - genewitch16 days ago
    can you point me at another free voice cloning / tts model with this fidelity and, i guess prompt adherence?
    because i've been on youtube and insta, and believe me, no one else even compares, yet.
- Onavo16 days ago
  Well DeepSeek V4 is rumored to be in that range and will be released in 3 weeks.
- aussieguy123416 days ago
  I could say the same about grok (although given there are better models for my use cases I don't use it). What part of divisive politics are you talking about here?
- sampton16 days ago
  Every time Dario opens his mouth it's something weird.
chriswep15 days ago
In my tests this doesn't come close to the years old coqui/XTTS-v2. It has great voice cloning capabilities and creates rich speech with emotions with low latency. I tried out several local-TTS projects over the years but i'm somewhat confused that nothing seems to be able to match coqui despite the leaps that we see in other areas of AI. Can somebody with more knowledge in this field explain why that might be? Or am i completely missing something?
girvo16 days ago
Amusingly one of their examples (the final Age Control example) is prompted to have American English as an accent, but sounds like an Australian trying to sounds American to my ear haha
rahimnathwani16 days ago
Has anyone successfully run this on a Mac? The installation instructions appear to assume an NVIDIA GPU (CUDA, FlashAttention), and I’m not sure whether it works with PyTorch’s Metal/MPS backend.
- magicalhippo16 days ago
  FWIW you can run the demo without FlashAttention using --no-flash-attn command-line parameter, I do that since I'm on Windows and haven't gotten FlashAttention2 to work.
- turnsout16 days ago
  It seems to depend on FlashAttention, so the short answer is no. Hopefully someone does the work of porting the inference code over!
- Lichtso15 days ago
  Yes, using mlx-audio. See https://news.ycombinator.com/item?id=46726440
  - rahimnathwani15 days ago
    Thanks! Simon's example uses the custom voice model (creating a voice from instructions). But that comment led me eventually to this page, which shows how to use mlx-audio for custom voices:
    https://huggingface.co/mlx-community/Qwen3-TTS-12Hz-0.6B-Bas...
    uv tool install --force git+https://github.com/Blaizzy/mlx-audio.git --prerelease=allow python -m mlx_audio.tts.generate --model mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16 --text "Hello, this is a test." --ref_audio path_to_audio.wav --ref_text "Transcript of the reference audio." --play
- javier12345432116 days ago
  I recommend using modal for renting the metal.
PunchyHamster16 days ago
Looking forward for my grandma being scammed by one!
- jacquesm16 days ago
  So far that seems to be the main use case.
- bigyabai16 days ago
  Grandmas should know better, nowadays. It's 2026, half of today's grandparents grew up with QVC and landline psychics.
viraptor15 days ago
I can't quite figure this out: Can you save a generated voice for reuse later? The mlx-audio I looked at seems to take the text itself in every interface and doesn't expose it as a separate object. (I can dive deeper, but wanted to check if anyone's done it already)
- akadeb15 days ago
  You could pipe the output to an audio file with ffmpeg or pyaudio and save it locally
  - viraptor15 days ago
    I don't want to save the audio. I want to save the voice model so I can use it for many different texts, for consistency.
    stuckkeys15 days ago
    Yes, you can. I was just testing it. I made a "My Custom Voices" tab, and recorded a small sample of my own voice or upload a sample of w/e voice. Then you can use it. I am in the process of training a model of my voice too to see how it handles it using the 1.7b
    Works surprisingly good with a 4090. I will also try it on 5090. This is the best one I have seen so far. NGL. 11Labs is cooked lol.
d4rkp4ttern15 days ago
Curious how it compares to last week’s release of Kyutai’s Pocket-TTS [1] which is just 100M params, and excellent in both speed and quality (English only). I use it in my voice plugin [2] for quick voice updates in Claude Code.
[1] https://github.com/kyutai-labs/pocket-tts
[2] https://github.com/pchalasani/claude-code-tools?tab=readme-o...
satvikpendem16 days ago
This would be great for audiobooks, some of the current AI TTS still struggle.
anotherevan16 days ago
Is there any way to take a cloned voice model and plug into Android TTS and/or Windows?
I have a friend with a paralysed larynx who is often using his phone or a small laptop to type in order to communicate. I know he would love it if it was possible to take old recordings of him speaking and use that to give him back "his" voice, at least in some small measure.
thedangler16 days ago
Kind of a noob, how would I implement this locally? How do I pass it audio to process. I'm assuming its in the API spec?
- dust4216 days ago
  Scroll down on the Huggingface page, there are code examples and also a link to github: https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base
- daliusd16 days ago
  I wanted to try this locally as well so I have asked AI to write CLI for me: https://github.com/daliusd/qtts
  There are some samples. If you have GPU you might want to fork and improve this, but otherwise slow, but usable on CPU as well.
7777777phil15 days ago
Here is a Colab Notebook where you can test it on any of the available GPUs (H100, A100, T4): https://colab.research.google.com/drive/1szmNh25TmMpPd4aKjWX...
indigodaddy16 days ago
How does the cloning compare to pocket TTS?
- andhuman16 days ago
  It’s uncanny good. I prefer it to pocket, but then again pocket is much smaller and for realtime streaming.
  - indigodaddy16 days ago
    Ah right I guess I meant for instant which I assume qwen can't do
- quinncom16 days ago
  Pocket TTS is much smaller: 100M parameters versus 600–1800M.
  - indigodaddy16 days ago
    Ah right so I guess qwen3-tts isn't going to work for cpu-only like pocket TTS can(?)
    magicalhippo15 days ago
    The current code doesn't appear very optimized. Running on CPU-only it only uses four threads for example, nowhere close to saturating all my cores.
    As a result it's dog slow on CPU only, like 3-4 minutes to produce a 3 second clip, and still significantly less than real-time on my 5090 using only 30% of the GPU.
gunalx16 days ago
Voice actors are slo cooked. Some of the demos arguably sounded way better than a lot of indie voice-acting.
whinvik16 days ago
Haha something that I want to try out. I have started using voice input more and more instead of typing and now I am on my second app and second TTS model, namely Handy and Parakeet V3.
Parakeet is pretty good, but there are times it struggles. Would be interesting to see how Qwen compares once Handy has it in.
- Footprint052116 days ago
  Why parakeet over whisper v3 turbo? Just curious as one who heavily uses whisper, I’ve seemed to have better results with that
  - whinvik16 days ago
    Parakeet is much smaller and for me the perf/speed combo has just been better.
  - 16 days ago
    undefined
- woodson16 days ago
  This is about speech to text, not speech recognition.
lostmsu16 days ago
I still don't know anyone who managed Qwen3-Omni to work properly on a local machine.
JonChesterfield16 days ago
I see a lot of references to `device_map="cuda:0"` but no cuda in the github repo, is the complete stack flash attention plus this python plus the weights file, or does one need vLLM running as well?
khimaros15 days ago
i made an epub to audiobook generator using this with optional LLM integration for dramatized output: https://github.com/khimaros/autiobook -- also submitted here: https://news.ycombinator.com/item?id=46737968
naveen-zerocool15 days ago
I just created a video trying it out - https://youtu.be/0LU9nmnR0cs
albertwang16 days ago
great news, this looks great! is it just me, or do most of the english audio samples sound like anime voices?
- numpad016 days ago
  I suspect they might be using voice lines from Chinese gacha games in addition to what clearly sound like VTubers, YouTubers, and Chinese TV documentary narrations. Those games all come with clean monaural CN/JP/EN files consistent in contents across language for all regions, for, an obvious[1] reason.
  1: https://old.reddit.com/r/ZenlessZoneZero/comments/1gqmtl1/th...
- rapind16 days ago
  > do most of the english audio samples sound like anime voices?
  100% I was thinking the same thing.
- bityard16 days ago
  Well, if you look at the prompts, they are basically told to sound like that.
  And if you ask me, I think these models were trained on tween fiction podcasts. (My kids listen to a lot of these and dramatic over-acting seems to be the industry standard.)
  Also, their middle-aged adult with an "American English" accent sounds like any American I've ever met. More like a bad Sean Connery impersonator.
- reactordev16 days ago
  The real value I see is being able to clone a voice and change timbre and characteristics of the voice to be able to quickly generate voice overs, narrations, voice acting, etc. It's superb!
- devttyeu16 days ago
  Also like some popular youtubers and popular speakers.
  - pixl9716 days ago
    Hmm, wonder where they got their training data from?
- thehamkercat16 days ago
  even the Japanese audio samples sound like anime
- htrp16 days ago
  subbed audio training data (much better than cc data) is better
sails16 days ago
Any recommendations for an iOS app to test models like this? There are a few good ones for text gen, and it’s a great way to try models
- bigyabai16 days ago
  Besides UTM, no.
swaraj16 days ago
Tried the voice clone with a 30s trump clip (with reference text), and it didn't sound like him at all.
jakobdabo16 days ago
Can anyone please provide directions/links to tools that can be run locally, and that take an audio recording of a voice as an input, and produce an output with the same voice saying the same thing with the same intonations, but with a fixed/changed accent?
This is needed for processing an indie game's voice recordings, where the voice actors weren't native speakers and had some accent.
sinnickal15 days ago
Prepare for an influx of sensational hot-mic clips allegedly from high profile people
dangoodmanUT16 days ago
Many voices clone better than 11labs, while admitedly lower bitrate
ideashower16 days ago
Huh. One of the English Voice Clone examples features Obama.
- subscribed16 days ago
  Distinct, characteristic voice. My first to play with will be Morgan Freeman.
- illwrks16 days ago
  I think the other sounds like Steve Jobs - I could be wrong though!
jonkoops15 days ago
Honestly, this seems like it could be pretty cool for video games. I always liked Oblivion's 'Radiant AI', this could be a natural progression, give characters motivations, relations with the player and other NPCs and have an LLM spit out background dialogue, then have another model generate the audio.
wahnfrieden16 days ago
How is it for Japanese?
- numpad016 days ago
  The demo page only has three samples for Japanese, and one of it pronounces taskete as itsukete(???), so...
  - wahnfrieden16 days ago
    Thanks. All modern TTS for Japanese are useless failures.
    numpad016 days ago
    Just use that green guy, classical TTS ain't broke nanoda.
    wahnfrieden15 days ago
    I’m building Manabi Reader and would need to contact them for an enterprise contract. Annoying licensing but maybe inevitable
- salzig16 days ago
  there is a sample clone -> Trump speaks Japanese.
  Edit: "Cross-lingual Voice Clone" https://qwen.ai/blog?id=qwen3tts-0115#voice-clone
salzig16 days ago
So now we're getting every movie in "original voice" but local language? Can't wait to view anime or Bollywood :D