Generate audiobooks from E-books with Kokoro-82M(claudio.uk)

420 pointsby csantini6 months ago56 comments

laserbeam6 months ago
On the one hand, this is very convenient. Probably cool for some non-fiction.
On the other, some of my favorite audio books all stood out because the narrator was interpreting the text really well, for example by changing the pacing during chaotic moments. Or those audiobooks with multiple narrators and different voices for each character. Not to mention that sometimes the only cue you get for who's speaking during dialogue is how the voice actor changes their tone. I have mixed feelings about using this and losing some of that quality.
I would totally use this over amateur ebooks or public domain audiobooks like the ones on project guttenberg. As cool as it is/was for someone to contribute to free books... as a listener it was always jarring to switch to a new chapter and hear a completely different voice and microphone quality for no reason.
- stavros6 months ago
  > On the other, some of my favorite audio books all stood out because the narrator was interpreting the text really well
  This (and everything else with AI) isn't saying "you don't need good actors any more". It's saying "if you don't have an audiobook, you can make a mediocre one automatically".
  AI (text, images, videos, whatever) doesn't replace the top end, it replaces the entire bottom-to-middle end.
  - j4coh6 months ago
    RIP to future top-enders that would normally have started out on the bottom to middle end.
    aredox6 months ago
    Bingo. AI is going to destroy any pathway for training and accruing experience.
    An embalming tech for our dying civilization.
    lupusreal6 months ago
    Just like printing presses killed the profession of copying books by hand, eliminating the training pathway for illuminated manuscripts. Death of civilization itself I say, damn those printing presses.
    oldgradstudent6 months ago
    There's a big difference.
    Printing presses produce superior products.
    A mediocre audiobook is certainly better than no audiobook at all, but it is an inferior product to a well produced audiobook.
    gampleman6 months ago
    > Printing presses produce superior products.
    That seems like a highly dubitable statement. Many hand illuminated manuscripts are masterpieces of art. The advantage of the printing press was chiefly economical making the cost of a copy dramatically less, not an increase in quality (especially so by the aesthetical standards of the time).
    jhbadger6 months ago
    Indeed. Even Gutenberg had his Bibles touched up by artists after they were printed (illuminated capital letters and so on) because even he believed his printed copies were inferior to the hand-made ones.
    monophonica6 months ago
    I would say it is the perfect metaphor.
    I love audiobooks but at this point, most of what I want to listen to is stuff that would not sell enough to bother having someone read.
    There are also many voice actors who I simply don't like the way they read.
    A future that I can pick a voice that I like for any PDF is a huge upgrade.
    I think a problem people have is if on the young side, maybe didn't expect the future to change like this.
    No one I knew went on the internet when I graduated high school. Change like this is all par for the course. The only advice I got in high school from a guidance counselor was that I had a nice voice for radio. Books on tape was not exactly a career option at the time. The culture will survive the death of a career path that didn't even really exist when I was a senior in high school.
    oldgradstudent6 months ago
    As a work of art, sure. But as books containing information, printing presses produced superior products.
    karamanolev6 months ago
    Many (most, if not all) hand-made copies contained errors, which printed books did not. They were much closer to 1:1 copies.
    jhbadger6 months ago
    If the mistake happened in the typesetting stage, printed books could spread errors much more efficiently, as in the infamous "wicked bible" of 1631, where a typesetting error made the ten commandments contain the amusing phrase "Thou shalt commit adultery". Surviving copies are quite the collectors' item as most were destroyed.
    https://en.wikipedia.org/wiki/Wicked_Bible
    oldgradstudent6 months ago
    Usually, though, errors are corrected and every every printing has fewer errors than the previous one.
    kamarg6 months ago
    What percentage of books get a second print run on a printing press? And what's the process for that? Do they have to reset each word for the second run? I genuinely don't know how a physical process like typesetting can result in increased accuracy on each print.
    aredox6 months ago
    Any interesting book gets a second print run - except if it was on purpose a limited edition with some exceptional quirk.
    Workaccount26 months ago
    What we have today is early gen "practical" AI.
    Even current SOTA models would almost certainly be able to handle multiple speakers and pick-up on the intended tone and intonation.
    Don't make the mistake of thinking what we have today is what we will still be working with in 5 or 10 years.
    fidelramos6 months ago
    Some people will learn to use these AIs to make top-quality audiobooks (and books, movies, TV shows, comics...). It will be a more manual process than pressing a button, but still orders of magnitude less than what it took before. As a result there will be a tsunami or high-quality content.
    There will be curation and specialization. Previously ignored niches now will be economically profitable. It will be a Renaissance of creativity, and millions of jobs will be created.
    j4coh6 months ago
    If you see podcasts as useless in modern society as illuminated manuscripts, no big loss I suppose, but I do enjoy the human made ones and would be sad to see them go extinct as the manuscripts did. And the same thing is happening to other entry-level creative roles, some of which you may personally regret the loss of too.
    lupusreal6 months ago
    Actually I think illuminated manuscripts had more value, insofar as they were art, than podcasts (99% of which are vapid timewasters and/or friend simulators.) The good podcasts are those view which involve interviewing interesting people, and AI isn't replacing those.
    There's a lot more to be said for the value of audio books, but the accessibility gains of proliferated auto-generated audiobooks outweigh the downside of losing a small number of expertly produced audio books.
    For context, I listen to audio books a lot, and for years I have listened to traditional TTS readings of books too. Better voice generation for books without audiobooks is a great win for society.
    akho6 months ago
    I enjoy looking at illuminated manuscripts. Podcasts are bullshit and can die in a ditch.
    teekert6 months ago
    I enjoy podcasts but I still hope illuminated manuscripts won’t die in a ditch so other people can enjoy content the way they prefer ;)
    littlestymaar6 months ago
    Given that the printing press was the root cause for the century of religious wars that soaked Europe with blood, and was key in the revolutions that overthrown absolute monarchies all over Europe, I don't think it's as good as an example as you think it is.
    Death of a civilization doesn't mean disappearance of mankind or even overall regression on the long term.
    megaloblasto6 months ago
    Do you have a source for that? I don't think the printing press was the cause of religious wars any more than bullets were the cause of WWII
    llamaimperative6 months ago
    Have you heard of the Protestant Reformation and the following 120 years of war? The entire Protestant <> Catholic blow up that consumed Europe was pretty directly attributable to the printing press.
    (To be clear, nothing is solely and exclusively caused by any one thing. Causality is a very fuzzy concept. But sans printing press, those wars certainly wouldn’t have happened when/where/how they did, if they ever happened at all).
    xkriva116 months ago
    Do you know Hussites? [1] The Hussite Wars (1419–1434) predate printing press and Luther told: "We are all Hussites without knowing it."
    [1] https://en.wikipedia.org/wiki/Hussites
    baq6 months ago
    Easy access to the Bible text instead of being only read to, hence high literacy of the faithful, was one of the core tenets of some branches of Protestantism.
    6 months ago
    undefined
    thoroughburro6 months ago
    This is common enough knowledge that “read, like, any history” is an appropriate response. However, if you’re genuinely curious, here’s a random link:
    https://ehne.fr/en/encyclopedia/themes/european-humanism/eur...
    lupusreal6 months ago
    I blame canned food and trains for solving the logistics problems that previously prevented massive wars.
    _DeadFred_6 months ago
    An interesting one I read was public schools and their creation of a national identity. Before public schools there weren't really standardized languages forced upon an entire nation, etc. The countryside was more one country/people/language morphing into the next, not clean delineated lines where country/language switched instantly. It was also said borders were much more open/abstract before the resultant shift as well.
    littlestymaar6 months ago
    Napoleonic wars beg to differ.
    sigilis6 months ago
    While they didn't have trains, the Napoleanic wars did feature the first use of canned food to aid in logistical supply of armies. You could argue that the lack of trains (and can openers) probably meant that they jumped the gun on starting giant wars. We Americans fixed that in the Civil War, to great and deadly effect.
    littlestymaar6 months ago
    Appertization was invented in 1804 but Appert did not sell his technology to the French army before 1810 so it's fair to say that most of the Napoleonic wars were run before canned food was even a thing. Maybe it has seen mainstream use in the Grande Armée in the end of his reign, but it was definitely not a deciding factor in Napoleon's logistics for most of his campaigns.
    Without trains, the logistics of canned food isn't much better than the logistics of any bread-based food you give to your soldiers. It doesn't solve the weight problem which is the key problem with preindustrial army logistical issue.
    turnsout6 months ago
    Those revolutions were ultimately positive. The alternative would be the continued rule by monarchs and a single powerful religion
    littlestymaar6 months ago
    See my second paragraph. It can be ultimately positive while still being civilization-ending.
    chairmansteve6 months ago
    No comfort to the millions who died though.
    Melomomololo6 months ago
    [dead]
    _DeadFred_6 months ago
    It's kind of wild to me that the future will look like the 80s imagined it all because AI killed the creative seed corn when retro-future 80s was the aesthetic.
    azeirah6 months ago
    We'll be ok lol, while it is a significant transition, it IS just a transition in the media landscape.
    AI is big and significant, but we'll be ok. There is also no such "one" thing as "our civilisation". We're deeply interconnected extremely vast and complex interconnected networks of ever-changing relationships.
    AI does indeed represent the commoditisation of things we used to really value like "craftsmanship in book narration" and "intelligence". But we've had commoditisations of similar media in the past.
    Paper used to be extremely expensive, but as time went on, it became more and more commoditised.
    Memory used to be extremely expensive (2000-3000 years ago, we needed to encode memory in _dance_, _stories_ and _plays_. Holy shit). Now you can purchase enough memory to store a billion books for maybe two hours of labor.
    Most of these things don't really matter. What is happening is that the media landscape is significantly shifting, and that is a tale as old as history.
    I do think the intellectual class will be affected the most. People who understand this shift stand to benefit enormously, while those who don't _might_ end up in a super awful super low class.
    And yet, all of that doesn't really matter if you just move to, I dunno, Paramaribo or whatever. The people there are pragmatic and friendly. They don't care about AI too much. Or maybe New Zealand, or Iceland, or Peru, or Nepal or I don't know.
    The world isn't ending. Civilisation isn't being destroyed at our core.
    The media landscape is changing, classes are shifting, power-relationships are changing. I suggest you think deeply about where you want to live, what you stand for and what is most important to you in life.
    I don't need money or tech to be happy. I am fine with just my cats, my closest friends and family and healthy food.
    If it happens to be the case that I need to leave tech or that extremely high-end narrated audiobooks cease to exist? Then all I have to say is "oh no, anyway".
    We'll be fine. One way or another.
    Just different.
    n3rv6 months ago
    That sure is some naivety ya got there. But good luck on the move. Keep your friends and family close.
    6 months ago
    undefined
    sam_lowry_6 months ago
    > RIP to future top-enders that would normally have started out on the bottom to middle end.
    This stance always reminds me of the Profession, a 1957 novella by Isaac Asimov that depicts pretty much the future where there are only top performers and the ignorant crowd.
    xyproto6 months ago
    He was a clear thinker.
    anothermathbozo6 months ago
    Virtually every book I want this for has been around for 70+ years and still no high or low quality audiobook has been produced. How long do I have to wait for those aspiring top-enders before an audiobook can be made available?
    Arainach6 months ago
    That has nothing to do with audiobook voice actors and everything to do with copyright and who owns the rights to the book (and whether they believe there's any money to be made selling an audiobook version).
    stevenwoo6 months ago
    Piracy may have made some of these accessible by ripping the US library of congress recordings for the blind.
    gosub1006 months ago
    I'm super opposed to AI, but I see this as a rare positive. As someone already said, the win here is to have a audiobook where one doesn't yet exist. hell, maybe the tables will turn and the scrubs will do the hard work of discovering which titles are popular with an audience, then the ebook industry can capitalize on AI by hiring voice actors to produce proper titles?
    DidYaWipe6 months ago
    Not gonna happen. Once the AI shit is out there, people will have consumed it by the time a real actor can create (and edit) the audiobook.
    6 months ago
    undefined
    CuriouslyC6 months ago
    It's common for shows to use big name actors as voices because they draw an audience, nothing will change. Just means a smaller pool of voice actors and they'll mostly be good looking.
    cmdtab6 months ago
    The value of distribution is increasing while the value of content and product is decreasing for all but the top end.
    Der_Einzige6 months ago
    Not RIP at all. "Meritocracy" was coined in a book literally warning us about how terrible such a society would be: https://en.wikipedia.org/wiki/The_Rise_of_the_Meritocracy
    The "top-enders" are the privileged who need to have some of their gains for their intelligence redistributed to others. The alternative is "survival of the smartest", which is de-facto what we have today and what Young was trying to warn us about.
    credit_guy6 months ago
    By that time, AI will beat the toppest of the top enders. Remember the time Deep Blue barely beat Kasparov? Now no human, or group of humans can beat a chess engine, even one that runs on an iPhone.
    plastic31696 months ago
    I don’t think chess is a good example of AI destroying the path to the top. Chess is more popular now and humans keep advancing even though it is futile effort against computers.
    rcxdude6 months ago
    And people are better at chess now in part because of practicing with/against machines. But chess has never been something you can make a living off of unless you were at the very top.
  - numpad06 months ago
    AI TTS has been available for quite some time. Tacotron V1 is about 8 years old. I don't think we saw much bottom end replacement.
    IMGO(gut opinion), generative AI is a consumption aid, like a strong antacid. It lets us be done with $content quicker, for content = {book, art, noisy_email, coding_task}. There's obvious preconceptions forming among us all from "generative" nomenclature, but lots of surviving usages are rather reductive in relevant useful manners.
    sam_lowry_6 months ago
    Yeah, let us not blame AI. Audible damaged the quality of audiobooks than AI.
  - no_wizard6 months ago
    Bottom end really, Middle end is still superior to this AI drivel.
- felixhummel6 months ago
  I wholeheartedly agree. https://en.m.wikipedia.org/wiki/Stephen_Briggs got me hooked on Terry Pratchett's Discworld series. I loved "Going Postal".
  - IndrekR6 months ago
    I know someone who listened Terry Pratchett's "Wachen! Wachen!" audiobook on Spotify while living in Germany for few years. It was so well narrated that he also acquired some peculiarities of local dialects used by specific characters in the book. Locals in Bavaria were quite surprised of a foreigner speaking such language.
- dmazin6 months ago
  Absolutely.
  Even on the non-fiction side, the narration for Gleick's The Information adds something.
  While I want this tool for all the stuff with no narration, NYT/New Yorker/etc replacing human narrators with AI ones has been so shitty. The human narrators sound good, not just average. They add something. The AI narrators are simply bad.
- ldoughty6 months ago
  I agree with you, but also want to point out:
  New authors, self-publishers, can't afford tens of thousands of dollars to get an audiobook recorded professionally... This can limit their distribution.
  Authors might even choose not to make such version (or lack confidence to record themselves), so AI capable of making a decently passable version would be nice -- something more than reading text blandly. AI in theory could attempt to track the scene and adjust.
  - plorg6 months ago
    By observation the current approach is for authors to narrate the book themselves of they think their readers will want it and if they feel reasonably confident in their own narration.
  - DidYaWipe6 months ago
    You can get narrators to work on a royalty basis.
- WillAdams6 months ago
  Yes, but if the alternative is not having a book, or having to listen to one poorly read (I love Librivox, but there are some books which I just haven't been able to finish because of readers, and many more which were nixed for family vacation travel listening on that account), this may be workable.
- micw6 months ago
  With this technology, one could produce high quality audio books without having access to high quality narrators by annotating the books with the voice, speed and such things.
  I wonder if a standardized markup exists to do so.
  - albert_e6 months ago
    There is SSML for speech markup to indicate various characters of speech like whispers, pronunciation, pace, emphasis, etc.
    With LLMs proving to be very good at generating code, it may be reasonable to assume they can get good at generating SSML as well.
    Not sure if there is a more direct way to channel the interpretation of the tone/context/emotion etc from prose into generated voice qualities.
    If we train some models on ebooks along with their professionally produced human-narrated audiobooks, with enough variety and volume of training data, the models might capture the essence of that human-interpretation of written text? Just maybe?
    Amazon with its huge collection of Audible + Kindle library -- if it can do this without violating any rights -- has a huge corpus for this. They already have "whispersync" which is a feature that syncs text in a kindle ebook with words in corresponding audible audiobook.
    micw6 months ago
    Good points, thank you! I just tested it. While ChatGPT was very good in adding generic (textual) annotations, the result for generating SSML where very poor (lack of voice names, lack of distinction between narrator and character etc).
    Probably the results with a model trained for this plus human audit could lead to very good results.
  - pegasus6 months ago
    They still wouldn't be high quality. It's just not possible to capture the precise tone of voice in an annotation, and that precision I believe really makes a difference. My experience is that the deeper the narrator understands the text and conveys that understanding, the easier it becomes for me to absorb that information.
    vasco6 months ago
    Have you tried those "podcast from a paper" models? They do some of the things you are saying they don't, although it's not 100% it's also miles ahead of for example human Polish TV lectors, or other monotone style narrations.
  - KeplerBoy6 months ago
    Don't end to end trained models already do this to some extent? Like raising the pitch towards a question mark, like a human would.
    TortoiseTTS has a few examples under prompt engineering on their demo site: https://nonint.com/static/tortoise_v2_examples.html
    micw6 months ago
    That's a bit of basic and random. Some models have the features you describe. From the better models you get a slightly different voice for text in quotes.
    But the difference to good audio books is that you have * different voices for the narrator and each character * different emotions and/or speed in certain situations.
    I guess you could use a LLM to "understand" and annotate an existing book if there's a markup and then use TTS to create an audio book from it and so automate most of the the process.
    micw6 months ago
    Edit: I actually tried this. I prompted in ChatGPT:
    "Annotate the following text with speakers and emotions so that it can be turned into an audiobook via TTS", followed by a short text from "The Hobbit" (The "Good morning scene"). The result is very good.
- ahoka6 months ago
  I guess this is still very useful if you are blind.
  - loktarogar6 months ago
    Yeah, for accessibility purposes on things that aren't already narrated, this is kind of thing is huge.
    em-bee6 months ago
    that's the thing. it's not just for accessibility. anything not already narrated is a fair target for TTS. i don't have time to sit down and read books. all reading is done on the go, while getting around or doing daily routines at home. i have a small book that i am reading now, which should take a few hours to finish, but in the time i manage to get done reading it i will probably have listened to two or three audio books.
    oh, and it's also a boon for those who can't afford to buy audiobooks.
    loktarogar6 months ago
    Accessibility is generally framed around providing accomodations to people with disabilities, but at its core it's about more people being able to access things they otherwise couldn't. By this metric we agree
    vasco6 months ago
    You don't choose to spend your time reading books. You probably roll your eyes when someone tells you they don't have time for some activity you deem valuable. This is the 'no time to exercise' debate in a different shape.
    They are also different activities, with audio it's easier to listen to more but retention is usually lower. Not casting any elitist "you need to read" bullshit by the way, but find it odd to define it in terms of lack of time, and I really like both mediums.
    em-bee6 months ago
    there is not much of a choice here. sure, i could use the time i spend reading and commenting on HN to read books instead. so technically speaking it is a choice. but i want to do both and many other things besides also having to work and a family to take care of. so the result is, i can't afford the time to read without giving up other things that are also important to me. listening to books allows me to access books i would otherwise not be able to read because of these priorities.
    there are other factors as well. i love reading so much that i tend to forget time around me. as a result reading would cause me to neglect other duties. i can't allow that, and therefore i am forced to avoid reading. i also don't like long form reading on electronic devices, and as a frequent traveler, printed books are simply not practical and often not even accessible.
    i agree with the retention issue, but i found that a much larger factor for retention is how well i can follow the story. a good story that is easy to get into is also easier to retain. and finally, reading fiction is for entertainment. i don't have to retain it.
    esrauch6 months ago
    > You probably roll your eyes when someone tells you they don't have time for some activity you deem valuable.
    There's a few categories where it makes sense to roll your eyes, like if they say they have no time to shower or have never been to one of their kid's baseball games.
    But for things that aren't basic human expectations, I think you'd have to a real jerk to roll your eyes at someone not having time. No time to cook multi-pot dishes? No time to exercise? No time to read? No time to go to museums? No time to meet at the bar for a drink? Any of them sensible.
    No one can do everything, we all make our priorities and its well within their choice not to have any one optional life thing at the top of their personal stack.
    vasco6 months ago
    Agree completely, my point was indeed they are choices, not lack of time. I think I came across too judgy even trying not to. You made a better job of it.
    hombre_fatal6 months ago
    This is a weird comment. They are just saying why they prefer audiobooks thus why general TTS is useful for them.
    Why are you trying to argue about their preference? They didn't cast any judgement on others with different preferences.
    This is nothing like “no time for exercise”.
    It's more like "I have no time (preference) to fire up the wood stove so I use microwave" and then you come in with "wow so you roll your eyes at us fire stove users?"
    vasco6 months ago
    Two hours before you posted this there was already an admission from me in a sister comment that I came across too judgy and someone else made the point I tried better than myself - not sure how much penitence I need to do but sorry again :)
    flir6 months ago
    I was just thinking about automatically slapping an mp3 on every blog post, just an accessibility nicety.
    Can someone with low vision tell me if this would be useful to them? It may be that specialist tools already do this better.
    laserbeam6 months ago
    People use screen readers for accessibility. I would not expect anyone to be able to "look for and find" your mp3... I would instead expect them to use the tool they normally use for accessibility.
    The real question is "what tools are they already using and how can I make sure those tools are providing higher quality output?". There are standards in browsers for these kinds of things (ways to hint navigation via accessibility tools for example).
    flir6 months ago
    > I would instead expect them to use the tool they normally use for accessibility.
    Yes, that was my second thought. But I'd rather ask someone than rely on my assumptions.
- taude6 months ago
  Agree with you on this.
  My example, I was never a Wheel of Time fan, but the new audio editions done by Rosamund Pike are quite the performance, and make me like the story. She brings all the characters to life in a way thats different than just reading. It's a true performance.
- lern_too_spel6 months ago
  On the other hand, there are a lot of narrators who are just bad, and the publisher is not going to pay for an alternate narration. These tools are a good way to re-narrate Wil Wheaton narrated books with correct pronunciation and inflection, for example.
  Computer chess took a long time to get better than the best players in the world, but it was better than most chess players for many years before that. We're seeing that a lot with these generative models.
- Oneunscripted6 months ago
  I guess using different narrators is essential for both fiction and non-fiction books if you want the full experience. Personally, I love it when audiobooks have narrators who stick to the characters’ personalities—it just feels right. Some of the audiobooks I’ve listened to have narrators who switch up their voices for each character, and others even use a different narrator for every character, which gets really good. Narration Box has been doing a really great job with this lately
- stevenwoo6 months ago
  A couple of my favorite audiobooks are Stranger in a Strange Land and Flowers for Algernon where the performer changes the intonation and enunciation of main character with the character’s journey and it was a revelation and made me appreciate the stories in a way I did not get reading the printed books the first time. Just the consistency of the performance is sometimes difficult to do in my imagination perhaps.
- whazor6 months ago
  A GenAI model that read audiobooks with such dramatisation is really my dream. There are so many books that I would want to listen to, but still lack such an adaptation. Also it takes months after the book release before the audiobook gets released.
  Just imagine what this would do for writers. They can get instant feedback and adjust their book for the audiobook.
- rd112356 months ago
  I agree but the opposite can be true too. Sometimes the narrator seems to target some general audience that doesn’t fit me at all, in a way that makes me cringe when I listen, until I stop listening altogether. In these cases I’d rather listen to a relatively flat narration from a tool like this.
- gmuslera6 months ago
  Would a "better" AI would do a "better" narration with a better understanding of the text? Of course that it would imply a different (and far bigger?) model.
  Anyway, even if in theory it might, in practice things may end even worse than doing it with a monotone voice.
- Melomomololo6 months ago
  I like one speaker in one particular book.
  He also narrates another scifi book series and honestly I dislike this a lot.
  He became the voice of one particular character for me.
  I would love variety
delegate6 months ago
The quality is great (amazing even), but I can't listen to AI generated voices for more than 1 minute. I don't know why, I just don't like it. I immediately skip the video on youtube if the voice is AI generated.
Might be because our brains try to 'feel' the speaker, the emotion, the pauses, the invisible smile, etc.
No doubt models will improve and will be harder to identify as AI generated, but for now, as with diffusion images, I still notice it and react by just moving on..
- rockemsockem6 months ago
  That kinda means the quality isn't great or amazing. Good TTS should be nearly or indistinguishable from a human speaker and should include emoting, natural pauses, etc
- CMay6 months ago
  Haven't really been following the latest in TTS ML, but I expected this to be better or at least as good-bad as the stuff you hear on YouTube. Somehow it sounds worse. It really is jarring to listen to any of these ML voices and can't really stand it. Nope out of every video that uses them and can't tell if YouTube never recommends them to me for that reason, or just because the recommendations around what I watch are just so rarely going to be from some low reputation channel.
  Take a moment here for a second though and think about it. Even if these voices got to be really good, indistinguishable almost... would I want to listen to it even then? If it was an NPC's generated voice and generated dialogue in a game to help enrich the world building, maybe in that context. On YouTube or with newscasters? Probably not. Audio books? Think I would still rather have it be a real person, because it's like they're reading a story to me and it feels better if it's coming from someone. There's also the unknown factor, where if it's ML generated it's so sterile that the unknowns are kind of gone.
  Think about it like this, in the movie industry we had practical effects that were charming in a way. You could think about the physical things that had to occur to make that happen. Movie magic. Now, everything is so CG it's like the magic is gone. Even though you know people put serious hard work into it, there's a kind of inauthenticity and just lack of relevance to the real world that takes something away from it.
  It's like a real magician has interesting tricks, while an artificial magician is most likely just a liar.
  Still, I grant that it makes some cool things possible and there is potential if things are done right. Some positive mixture of real humans and machine generated stuff so it isn't devoid of anything connected to real life effort.
- _DeadFred_6 months ago
  For new generations/those coming up now this will be the norm and not generate the negative reaction is does for us, it will just be part of how the world is and has always been, and eventually we will be the minority.
  Future generations will never know a world where you don't watch a 2 hour AI generated orientation video about the wonders of working for Generic Corp when you start a new job.
- xdennis6 months ago
  Among other things, what I don't like is the hallucinated stress. Take the classic example of:
  > I never said she stole my money
  It can have 7 different meanings based on which word you stress out.
  The new AI voices sound very natural at a shallow level, but overall pronounce things in odd ways. Not quite wrong, but subtly unnatural which introduces some cognitive load.
  Old TTS systems with their monotonic voices are less confusing, but sound very robotic.
  - DidYaWipe6 months ago
    erroneous or inappropriate ≠ hallucinated
- yjftsjthsd-h6 months ago
  > I immediately skip the video on youtube if the voice is AI generated.
  I mean, I do that because it's correlated with the content being garbage. If I'm intentionally using it on content I want to consume I expect it to be different, though I haven't gotten around to trying it properly yet so I guess we'll see. (OTOH I already listen to ebooks via pre-AI TTS, so I'm optimistic)
- karmasimida6 months ago
  Yeah same.
  Doesn't mean the quality is bad. In fact I think Kokoro's quality is amazing.
  But it is not the right tool for narration, the kind of training data they use make the sound too flat, if that makes sense.
swores6 months ago
Can anyone recommend an open source option that would allow training on a custom voice (my own, so I'd be able to record as many snippets as it needed to train on) to allow me to use it for TTS generation without sharing it off my machine?
Edit: I'll wait to see if any recommendations get made here, if not I might give this one a go: https://github.com/coqui-ai/TTS
- hm646 months ago
  Coqui is great, but in practice, I found Piper easier to set up, train, and deploy as an ONNX file. Big thanks to the Sherpa development team for their helpful resources: https://k2-fsa.github.io/sherpa/onnx/tts/piper.html and to the Rhasspy team for their training guide: https://github.com/rhasspy/piper/blob/master/TRAINING.md.
  I also found DEMUCS + Whisper + pydub to be a super helpful combo for creating quality datasets.
- phrotoma6 months ago
  https://github.com/DrewThomasson/ebook2audiobook
- drewbitt6 months ago
  There is a fork here https://github.com/idiap/coqui-ai-TTS 'coqui-tts'
  Though according to the TTS leaderboard, Fish Speech https://github.com/fishaudio/fish-speech and Kokoro are higher.
  https://huggingface.co/hexgrad/Kokoro-82M
  https://huggingface.co/fishaudio/fish-speech-1.5
  - xnx6 months ago
    AFAIK Kokoro can't be fine tuned
- numpad06 months ago
  I think you can probably generate TTS audio by classical means, and voice2voice that audio through RVC or Beatrice V2. Haven't looked into it in a while but Beatrice is apparently super fast and CPU only.
- eamag6 months ago
  F5-TTS, I wrote a post about it https://eamag.me/2025/Voice-Cloning
- jsemrau6 months ago
  I wrote this a while ago about xTTSv2 mixed with Nvidia's Nemo. Maybe it kicks off your journey.
  https://jdsemrau.substack.com/p/teaching-your-agent-to-speak...
- esskay6 months ago
  If I recall Coqui is very much a dead project, just one to be aware of.
pprotas6 months ago
I would love to have an e-reader that allows me to switch between text and audio at the press of a button. Imagine reading your book on the couch and then switching into audio mode while doing the dishes seamlessly, by connecting bluetooth headphones.
- InsideOutSanta6 months ago
  Kindles used to provide this feature, but publishers and/or the Authors Guild stopped it, because audio rights and text rights are handled differently. In other words, when Amazon sells you a text book, it does not have the right to then also do TTS on that text and let you listen to it.
  There's some contemporary discussion of what happened here: https://tidbits.com/2009/03/02/why-the-kindle-2-should-speak...
  I think there is still integration with Audible, though. If you buy a book on the Kindle and on Audible, the position will sync, and you can switch between listening and reading without losing your place in the book.
  - albert_e6 months ago
    Yes the feature is called WhisperSync -- I used it many years ago and it was pretty good.
    I tried it while on a treadmill so it allowed me to follow the book with more focus without sacrificing much else.
    thfuran6 months ago
    Isn't whisper sync the current version that relies on owning both the ebook and audiobook?
  - hamzakc6 months ago
    I am not sure if this still works, but 2-3 years ago I listened to a kindle book that I bought through my Echo show device. It was pretty good. I listened to it while I was cooking. It even allowed you to carry on where you left off. But I did notice that a few pages were skipped as I had read the book before. I have since packed away my echo show so I can't verify if they have removed this feature or not.
  - Brybry6 months ago
    I used that TTS feature semi-regularly on a Kindle 2.
    It wasn't a good experience but it was nice to be able to keep 'reading' a book while I was exercising.
    It worked for me for over a decade, until I broke the device. I don't know if I never updated the firmware or if the fact I used Calibre to convert books bypassed the feature gate.
- dsign6 months ago
  It is a supported feature in the epub 3.0 standard. It's possible to distribute an epub with audio, and have the audio sync to the HTML elements that form the ebook's text. And there is an e-reader that actually supports this feature, I can't remember which one now but it should be possible to find it with Google.
  It's more of an open problem how to create those epubs. I have some code that can do it using Elevenlabs audio, but I imagine it way harder to have something similar for a human narrator.... who's going to do the sync? Maybe we need a sync AI.
- freefaler6 months ago
  You can do it easily with non-DRM books (or DRM stripped books):
  For Android:
  - Moon+ reader pro - some paid high-quality TTS voices (like Acapella)
  For iOS:
  - Kybook reader and internal iOS voices (no external TTS voices for the walled garden)
  This works well enough to listen to a book while you walk and when you get back home read on the WC from the place you stopped.
  Additionally if you buy a tablet or an android ebook reader, you install the app there an you can continue on your bigger/better device seamlessly.
  Whisper-sync for the masses! Ahoy...
  - basedrum6 months ago
    But you need an android phone, and can't use a kobo or similar wink reader?
    freefaler6 months ago
    for ios you use Kybook on your iphone and your ipad. It syncs positions between the devices. When you go for a walk, opens Kybook, start TTS. When back home, open your tablet, you'll see the page TTS has stopped reading to.
    figers6 months ago
    How does this compare to using Apple's iBook or Kindle reader app and then the iPhone's built in text to voice (the female British voice is pretty good).
    freefaler6 months ago
    On iOS it is the same voice.
- monkeydust6 months ago
  Literally started doing that this week with Amazon Audible. I gave in an started the three month 99c trial and downloaded the app.
  What surprised me a good way was my Kindle app was aware of this and asked if I wanted to download the audible version of the current book I am reading.
  Been listening on the way to work and then reading on the way back. Enjoying it so far.
  - mmahemoff6 months ago
    Some Kindke books also have a checkbox to add the audio (for a fee) when you buy it. Sometimes I’ve seen books discounted to e.g. £0.99, but adding the audio might be £5.99. The upsell seems to be a good hack for adding some revenue when there’s a deep discount being used to drive interest.
- llamaimperative6 months ago
  Boox Ultra Tab whatever the fuck (their product naming sucks) + Readwise Reader = amazing for this
  Not quite seamless but it works. It has a cursor that follows the words as they’re spoken to, which allows you to read and hear (“immersive reading”) which I find to be extremely helpful for maintaining focus.
- leobg6 months ago
  iOS Voice Dream Reader. First app I install on a new iPhone since 2010 I believe. I will even cut and scan physical books just so I can read them in the app. The story of the guy who made it is also interesting!
qurashee6 months ago
This looks incredible! I’ve had an idea simmering in the back of my mind for a while now: creating an audiobook from an ebook for my commute using the voice of a specific audiobook narrator I really enjoy. The concept struck me after coming across the Infinite Conversation project here on HN. Unfortunately, I just haven’t found the time to bring it to life yet. :(
- eamag6 months ago
  For a specific narrator you can try F5-TTS, here's a post how https://eamag.me/2025/Voice-Cloning
- leobg6 months ago
  Made this for my kids for Christmas:
  - take an ebook in any language - AI translates it to German - AI speaks it using the voice of their fav narrator - a UI showing the text as it is being read
  Now they can read Asimov, Kulansky, Bryson, regardless of whether a translation or audio version exists. :)
- vinni26 months ago
  What about the copyright issue? You can’t mimic the voice of a narrator without their consent. OpenAI landed in trouble after using Scarlett Johansson’s voice in a demo.
  https://www.theverge.com/2024/5/20/24161253/scarlett-johanss...
  - notachatbot1236 months ago
    No limitations on this kind of thing if you are in private use.
    vinni26 months ago
    Forgive me for not knowing it was for personal use.
    qurashee6 months ago
    Indeed I was thinking about private use only.
  - benatkin6 months ago
    She only won in that OpenAI decided it wasn’t worth the trouble.
    K0balt6 months ago
    Yeah, by my ear it was pretty clearly not SJ’s voice-likeness, although there were some superficial similarities.
    But some people could have mistook it due to some regional accent similarities, though it would be akin to interpretation of any light southern drawl with a similar timbre as being SJ.
    mmahemoff6 months ago
    They also asked her for permission in advance, which was never going to help their case.
    K0balt6 months ago
    lol yeah. That made it look like anything vaguely similar would be an attempt to model SJ, and triggered due diligence by her legal team. Not filing at that point would be de facto precedent that SJ did not lay claim to her voice likeness.
    Sounds like SJ has a better legal team that OAI did.
  - amrrs6 months ago
    Kokoro really mentions that they used only permissive licensed voice
cwmoore6 months ago
The word “kokoro” means “heart” in Japanese, which I learned making the (heart shaped and paperback) puzzle books at https://www.kakurokokoro.com/
- tkgally6 months ago
  Note that kokoro (心) means “heart” in the sense of “spirit,” “soul,” “mind,” “emotions,” etc. It doesn’t mean “heart” in the sense of “internal organ that pumps blood.” That is shinzō (心臓).
  I once heard an American friend with so-so Japanese ability ask a Japanese woman who had recently had a heart operation how her kokoro was doing, and she looked surprised and taken aback.
  Side note: After I started reading HN in 2019, I was struck by how many tech products mentioned here have Japanese names. I compiled a list for a few years and eventually posted it:
  https://news.ycombinator.com/item?id=31310370
- terhechte6 months ago
  Its also the name of the AI in Terminator Zero https://villains.fandom.com/wiki/Kokoro
  I'm not sure if that is related here.
albert_e6 months ago
I hope a plugin for Calibre ebook management software comes along that makes it easier to convert select titles from your epub library to decent audio versions -- and a decent open source app for tablets and smartphones that can let us seamlessly consume both the ebook and audiobook at will.
Dowwie6 months ago
2025 may be the year where we can generate a dramatic audiobook with ambient music, sound effects, and theatrical narration using neural networks. Many of the parts already exist.
cess116 months ago
I would for sure not want this for fiction, it's too obvious that the voice has no understanding whatsoever of the text, but it's probably pretty nice for converting short news texts or notifications to audio.
- vanderZwan6 months ago
  Your point is a valid one, but I want to add to it that it is also a matter of expectations and how one listens.
  Years ago, when I was dating someone who spoke Russian as one of her native languages, we had to do a funny compromise when watching films together with her parents: they didn't speak a word of English, so we'd use the Russian dub with English subtitles.
  I noticed that the Russian dub was just one man reading a translation in a flat voice over what was happening on the screen, no attempts at voice acting or matching the emotions. Usually the dub would have a split second delay to the actual lines, so you'd still hear the original voices for a moment (and also a little bit in the background).
  At first I found it very jarring, but they explained that this flatness was a feature. You'll quickly learn to "filter out" the voice while still hearing the translation, and the faint presence of the original voices was enough to bring the emotional flavor back. The lack of voice acting helped with the filtering.
  This turned out to apply to me as well, even though I don't speak Russian! My brain subconsciously would filter out the dub, and extract most of the original performance through the subtitles and faint presence of the original voices. Obviously the original version would have been a better experience for me, but it was still very enjoyable.
  Of course a generated audiobook is not a dub, as there is no "original voice" to extract an emotional performance from. But some listeners might still be able do something similar. The lack of understanding in the generated voice and its predictable monotony might allow them to filter out everything but the literal text, and then fill it in with their own emotional interpretations. Still not as great as having proper story teller who does understand the text and knows how to deliver dramatic lines, but perhaps not as bad as expected either.
  - arafalov6 months ago
    Here is the rest of that story.
    When the foreign movies started to filter into the Soviet Union's illegal movie theatres, you would get 3 or 4 movies playing at once in one room. There would be a TV in each corner of the room and 4 or 5 rows of plastic chairs in front of it in an arch.
    ALL of the movies were being revoiced by the same person. So, if you were sitting in the back of the 5th row, you were potentially getting the sound from an action movie, a comedy, a horror movie and a romance at the same time. In the same voice.
    You learned to filter really well. So, if that's what they were trained on, watching a single movie must have been very relaxing.
    vanderZwan6 months ago
    Looking at the modern internet experience it sounds like the Soviet Union's illegial movie theatres were ahead of their time!
  - aleksiy1236 months ago
    Watching these as russian/english bilingual is very painful, tho I grew up in western world so maybe I'm just not used to it.
    To add on a slight tangent. Many books/audiobooks just don't exist in other languages at all. So even getting some monotone is a lot better than getting nothing.
    I think this is where these models really shine. Cheaply creating cross language media and unlocking the knowledge/media to underprivileged parts of the world.
    vanderZwan6 months ago
    > Watching these as russian/english bilingual is very painful, tho I grew up in western world so maybe I'm just not used to it.
    I figured that their opinion probably wasn't universal, hahaha.
    And yes, it's at the very least a win for accessibility
  - em-bee6 months ago
    indeed, audio books come in many forms, some are rather flat, and some include different voices, even by different speakers, or include a few voiced sound effects, laughing, crying, singing, etc. TTS is extra flat, but if the quality is good otherwise then it is like reading with my ears, and i add the emotions myself.
  - cess116 months ago
    It's not a "point", I didn't make an argument.
    I dislike german and russian style dubs as well, I'd rather learn a bit of the original language.
- calgoo6 months ago
  Audible has thousands of books available "for free" with their membership that are all AI generated. I was the same in the start, but after listening to a few, it really comes down to the voice used. I spent 8h on a plane listening to 1 book, and there was maybe 5 occasions where i had an issue with the voice; and i think all where just "AI weirdness", similar to chat LLMs messing up simple sentence structure or image generating LLMs adding an extra finger.
  - arafalov6 months ago
    The one I tried, had a lot of issues. It was a music theory book and it did not know how to pronounce C# (it kept saying C 'hash'). It also referred to, but did not read out the diagrams, or tables.
    So, it was not just the voice, but the quality control pipeline that was missing as well.
    Maybe it mostly works for old plain text books, but if nobody is checking.....
  - cess116 months ago
    I don't think dominant suppliers like Audible should exist so that matters little to me.
sysworld6 months ago
Finally! Been trying all the TTS models popping up on here for ages, and they've all been pretty average, or not work on Mac, or only work on really short text, or be reeealy slow.
But this one works pretty quick, is easy to install, has some passible voices. Finally I can start listening to those books that have no audio version.
I'm a slow reader, so don't read many books. If a book doesn't have an audiobook version, chances are I won't read it.
PS, I have used elevenlabs in the past for some small TTS projects, but for a full book, it's price prohibitive for personal use. (elevenlabs has some amazing voices)
Thank you to the dev/s who worked on this!
TypoAtLineZero6 months ago
I am having a very similar setup locally, which uses Chrome with the 'Read Aloud' plugin. I am capturing the audio stream via QJackCtl/VLC. Voices, speed, pitch can be adjusted. Efficient and quickly set up
lc646 months ago
"was trained on <100 hours of audio"
How the hell was it trained on that little data ?
- bbminner6 months ago
  I suppose it means per speaker. And it is based on a simplified style tts 2 which from my small dive into the subject seems one of the smaller models achieving great quality.
- 6 months ago
  undefined
- Havoc6 months ago
  Yeah that surprised me as well - seems low vs what is used on text llms . To be fair 100 hours of speaking is a lot of speaking though
  - edude036 months ago
    But it covers five? Languages so if all equal it’s just 20 hours per language.
    em-bee6 months ago
    in the linked audio sample it says the training data is mostly english. also another comment claims that the japanese quality is not good, so i'd be suspicious about all the other languages.
woolion6 months ago
If you look for a lot of the great classics, audiobooks results are inundated with basic TTS "audiobooks" that are impossible to filter out. These are impossible to listen to because they lack the proper intonation marking the end of sentences, making it very tiring to parse. It might be better than tuna can sounding recordings, especially if you want to ear them in traffic (a common requirement), but that's about it. The alternative, if you want real quality recordings, is to stop reading classics and instead read latest Japanime Isekai of murder mystery, these have very good options on the market. Anyway, I don't think it needs more justification that it covers a good niche usage.
I'm checking what the actual quality is (not a cherry-picked example), but:
Started at: 13:20:04 Total characters: 264,081 Total words: 41548 Reading chapter 1 (197,687 characters)...
That's 1h30 ago, there's no kind of progress notification of any kind, so I'm hoping it will finish sometime. It's using 100% of all available CPUs so it's quite a bother. (this is "tale of a tub" by Swift, it's about half of a typical novel length)
- csantini6 months ago
  Yeah, that's a known issue, if the book is all on a single chapter you don't get any sense of progress. I may fix that next weekend
  - woolion6 months ago
    It's not in one Chapter, but Chapters are called "Section" (and so ignored!). It should be simple to have a dictionary of the different units that are used (I would assume "Part" would fail too, as would the hilarious "Catpter" of some cat-themed kid book, but that's more complicated I guess?).
    It did finish and result is basically as good as the provided example, so I'd say quite good! I'll plan to process some book before going to bed next time!
    Chapter 1 read in 6033.30 seconds (33 characters per second)
msoad6 months ago
To people who are experts in AI TTS:
Why elevenlabs has such a lead in this space? It sounds better than OpenAI and Google models
- dbspin6 months ago
  Does it? The podcasts created by Notebook LLM are completely convincing, at least in terms of voice generation.
- eamag6 months ago
  Single-purpose company vs a huge corp with many other objectives in mind
katspaugh6 months ago
Sounds better than many books on Audible.
TheChaplain6 months ago
For accessibility I think this is a great thing, but as entertainment less so.
Example is Hobbit and Lord of the Rings, the narrator Rob Inglis, makes an amazing voice performance giving depth to environments and characters. And of course the songs!
flypunk6 months ago
I really liked it and added a variable speed argument: https://github.com/santinic/audiblez/pull/4
yoavm6 months ago
Was just looking for a TTS model to run locally for reading out loud articles, and never heard about Kokoro before! This looks great. I wonder if it can run in the browser somehow - could be a nice WebExtension.
- xkriva116 months ago
  What about the WASM running sherpa-onnx? No intallation required and can be served locally as well.
  https://k2-fsa-web-assembly-tts-sherpa-onnx-en.static.hf.spa...
- jiehong6 months ago
  I think most browsers support this already. Even maybe OS wide.
  I know it should work for Firefox on an article in reader mode.
  Or in MacOS you can select text and have it read out loud.
  - yoavm6 months ago
    I'm using Firefox and I do not see this option. Probably not working on Linux?
    sriacha6 months ago
    You might need to install/setup Speech Dispatcher. I was just using this implementation with Piper: https://github.com/Elleo/pied?tab=readme-ov-file.
    However easier way to read articles aloud is with Read Aloud extension: https://github.com/ken107/read-aloud.
    yoavm6 months ago
    That worked, thanks, though I find the speech quality in both options absolutely painful to listen to. The above WASM solution sounds about 100x better...
nottorp6 months ago
Well there was some hope with ChatGPT that people will go back to being able to process text communication.
Guess it was just a matter of time till someone figured out how to use "AI" to resume encouraging illiteracy.
- stavros6 months ago
  There was some hope with the rise of equestrianism that people will go back to be able to shoe horses.
  Guess it was just a matter of time till someone figured out how to use "cars" to resume encouraging being unable to to a basic farrier job.
  - nottorp6 months ago
    Except cars were faster than horses, while audio or video content is much slower than reading.
    stavros6 months ago
    Cars also have legs while audio doesn't, a point which is equally irrelevant. If people don't need to read, they don't need to read, and no matter how much a random Internet commenter wants them to need it, it won't change anything.
    Skills atrophy for a reason. It's fine to let them. You may as well be lamenting the lost art of long division.
    nottorp6 months ago
    > Cars also have legs while audio doesn't, a point which is equally irrelevant.
    That's what a LLM would say :)
    stavros6 months ago
    I'm sure an LLM wouldn't say anything as inane as that :P
    hombre_fatal6 months ago
    You can multitask with audio content, so you can consume content when you can't sit down to read. And you can even potentially consume more volume like on a long daily commute.
    It's not the case that it's worse.
nickpsecurity6 months ago
The page says it was trained on under 100 hours of audio. Then, the link says “we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training.” I don’t have time to read the paper to see what that means.
Depending on what that means, it might be more accurate to say it was trained on 100 hours of audio and with the aid of another, pre-trained model. The reader who thinks “only 100 hours?!” will know to look at the pretraining requirements of the other model, too.
skwee3576 months ago
Soon, AI will flood the market with mediocre everything: books, audio books, art, movies, websites.
The saddest thing is that people will still continue to participate in consuming these AI produced “goods”.
- vanderZwan6 months ago
  I think the saddest thing is that it's highly likely that real people will start to produce aesthetics that look/sound/etc like AI slop
  - skwee3576 months ago
    True, and I think with the recent news around Sporify using AI to fill their playlists, we are already getting there. Just need to condition the public that there is no better
- abroadwin6 months ago
  If there's one thing our capitalist society has taught me it's that people are always willing to endure a crappier product. I'm not sure we've found the bottom yet...
floppiplopp6 months ago
It sounds okay, but it lacks emotion and is monotone for fiction, it's the voice equivalent of the uncanny valley, which is probably fine if you don't really care.
- laserbeam6 months ago
  And when I don't care... to be honest I'm even OK with the dull browser TTS implementation when reading your average substack post. Shove the phone in my pocket, go shopping, get the jist of the article.
GaggiX6 months ago
There is also this TTS: https://github.com/rhasspy/piper that is pretty good (depending on the language) and extremely fast, would be cool to change the script to user Piper instead of Kokoro in case you want to use a language that is not supported by Kokoro or it's too slow, Piper supports a lot of them.
grwthckrmstr6 months ago
This is wonderful, and so happy to see the post where the author ran it locally on their Macbook.
I am curious, is there an equivalent light model for speech to text, that can run real-time on the MacBook? I'm just playing around with AI models and was looking into this (a fully locally running app that lets you talk to your computer).
zoidb6 months ago
Not directly related to the software, but interestingly on the authors website there is a Schedule a free call with me (https://claudio.uk/templates/call.html). I wonder if randos on the internet ever do that, and how it works out.
- rpastuszak6 months ago
  I've been doing it for a few years (+200 calls) and have met a ton of wonderful people this way.
  https://untested.sonnet.io/notes/say-hi/
  https://sonnet.io/posts/hi
- sam_lowry_6 months ago
  His LLM will answer the call.
mikkom6 months ago
What I really want and hope that someone does is to make an audiobook service that converts books to audiobooks but so that each character has own voice.
Som audiobooks have this and I think it really makes the experience much more engaging.
(Also maybe some background sound effects but not sure about that, some books also have this and it's quite nice too)
herculity2756 months ago
Very nice! I fiddled with this idea a few months back but the models available at the time were woefully slow on a macbook. Will definitely give this a spin, there's a large category of web serials and less popular translated novels that never get audiobook releases.
basedrum6 months ago
I want to be able to seemlessly read on my ebook reader and then put in my headphones and go for a walk with the dog and resume on audio where I left off. then when I come back, my ereader is at the right place where the audio finished and I can resume reading
- llamaimperative6 months ago
  Readwise Reader does this. A litttttle finicky at tracking read location but it’s workable
causality06 months ago
Has anyone gotten this working on windows? No matter where I put the files Powershell insists that kokoro-v0_19.onnx and voices.json aren't in the current folder.
plumbees6 months ago
As a mandarin learner, I find that the Chinese one lacks cadence, which makes it very hard as a learner to comprehend. It's like a machine gun of words without the subtle slight pause between sets of words that I would normally lean on.
mg6 months ago
Would this also be the best option if you just want to convert plain text files to audio?
- bArray6 months ago
  Markdown and PDF would also be cool. I think it's just a case of feeding the TTS model the right data at the right time. The special sauce is in the model, there's really not much to the code: https://github.com/santinic/audiblez/blob/main/audiblez.py
october81406 months ago
All these AI text to voice models seem to ignore emotion. It always sounds like a robot.
- iagooar6 months ago
  I wonder if AI could create a "commentary" script that instructs the TTS how to read certain words or chapters. The commentary would be like an additional meta-track to help the TTS make the best reading.
  That should actually be possible to do already with existing tech. I haven't seen if you can instruct Kokoro to read in a certain way, does anyone know if this is possible?
- lyu072826 months ago
  Like with almost everything, its an active area of research:
  https://emosphere-tts.github.io/
  We are getting there
  - boxed6 months ago
    Some of those samples sound like they are emoting in Korean while speaking English.
    lyu072826 months ago
    True, maybe an artifact of the training data, here is another one:
    https://www.microsoft.com/en-us/research/project/emoctrl-tts...
- arafalov6 months ago
  Try this one https://www.hume.ai/ - I found the demos (voice to voice) interesting.
- croes6 months ago
  Emotion is the acting part of voice acting. Hard to copy with AI
physicsguy6 months ago
I’m sure they sound more natural, but honestly, the text to speech built into my Kindle more than 10 years ago was good enough. Of course, Amazon killed that off because it would cannibalise sales to Audible.
causi6 months ago
I'm not able to try it until later, but regarding the sample audio: The voice quality is quite good, but what's going on with all the random pauses between words? It's very Captain Kirk.
maxglute6 months ago
Sounds really nice at 3x-4x speed, which I can't say for high quality TTS options last year. I'm wondering if there's metrics out there for audio speed vs clarity.
jaggs6 months ago
I really like this a lot. The default provides a really good audiobook feel, especially the Isabella voice. Any chance you could add in an API hook for optional ElevenLabs use?
6 months ago
undefined
monkeydust6 months ago
I have been looking for something credible that can voice over written emails (long form ones), documents and powerpoints locally ...this might be just the thing!
gunalx6 months ago
Kokoro seemed pretty nice for the size. I guess it is not much mvetter than a lot of the simpler tts. But at least it sounds less machinic than a few bad ones.
- outofpaper6 months ago
  It is essentially a set of voice models building on https://huggingface.co/spaces/styletts2/styletts2
  The odd thing is that while they are releasing these great sounding models, they are not documenting the training process. What we want to know is what magic if any allowed them to create such wonderful voices...
mrklol6 months ago
How can this support more languages than the model itself?
- Kye6 months ago
  The model might have stumbled on the generative AI equivalent of IPA.
carlosjobim6 months ago
Why isn't the audiobook market strong enough that it would make business sense to pay good narrators and actors for each book published?
- DidYaWipe6 months ago
  It is. But since when is "enough" enough for monopolistic/oligopolistic corporations?
therealdrag06 months ago
Do folks have a preferred toolkit for extracting text from web articles? I’d like to TTS articles friends send me.
vinni26 months ago
Can it also translate? I have family who would like audiobooks in German but most are in English only.
- em-bee6 months ago
  german is not listed as a supported language, so no. aside from that, i would not want to use computer translation. unlike TTS, which keeps getting better, translation quality still leaves a lot to be desired.
  - vinni26 months ago
    Ah thanks just noticed that. But which voice to use for French?
6 months ago
undefined
Havoc6 months ago
Wow that sample sounds really good
crorella6 months ago
Nice! It would be great to have per character voices
- boznz6 months ago
  this would be a game changer if done right. All good voice actors can carry a dozen different 'voices' for characters
geor9e6 months ago
This one sounds a bit robotic and takes ~4 hours per book on my M1 laptop, so I'll keep looking. For now, I'm happy my current method - EPUBReader browser extension, which opens .epub as an HTML page in Microsoft Edge browser, which has a "Read Aloud" button set to the Stephan natural voice at 1.6 speed. Best sounding voice I've ever heard, speaks fast, clear, crisp, with natural inflections to the sentences, and if I want to jump to somewhere I just left click the text at that spot. And it's instant - no conversions. Downside is I have to stay in bluetooth range of my laptop, so I'm still looking for a good phone based method. Google Play Books works okay, but gets buggy at 1.6 speed.
jaggs6 months ago
This looks really nice. And fast too it seems.
ekianjo6 months ago
japanese is not supported yet despite the claims. you can easily realize that by running the examples provided.
ajsnigrutin6 months ago
Just tried it, and "meh"...
It's one step above "normal" text-to-speech solutions, but not much above it. The epub has "Chapter 1" as the title on the page, and a lot of whitespace, and then "This was...." (actual text). The software somehow managed to ignore all the whitespace and reach "chapter 1 this was.." as a single sentance, no pauses, no nothing.
Blind? A great tool. Will it replace actual audiobooks? Well.. not yet at least.
cliftonpowell6 months ago
There's another project called ebook2audiobook that has produces some decent results.
callamdelaney6 months ago
It's insufferable.
leecarraher6 months ago
in case you are wondering how audiblez becomes an executable in the PATH from a pip install audiblez per the documentation
... audiblez book.epub -l en-gb -v af_sky.
it does not, instead it installs a python package with a cli interface, to run you then have to prepend python and load the module like this:
python3 -m audiblez book.epub -l en-gb -v af_sky.
Reimersholme6 months ago
[dead]
DidYaWipe6 months ago
Yes, because real narrator/actors are rolling in the dough. Let's kill one more profession with trash.
- bongodongobob6 months ago
  If it's trash then why would it kill the profession?
  - DidYaWipe6 months ago
    Because people will opt for readily-available free trash instead of paying for high quality. And then that quality isn't available to anyone at any price, so everyone loses.
    If you haven't observed this in many other markets, you live an unusual (or unobservant) life.
treetalker6 months ago
For anyone looking for an easier alternative (and one without the bugs the author describes, such as skipping some prefaces or failing to detect some chapters), Voice Dream Reader on iOS (and macOS) handles .epub and other e-books just fine and supports a variety of built-in and external voices.
- ivan_icin6 months ago
  There are many apps. Voice Dream isn't up to date with voice quality (which was amazing 10 years ago when it started, but now even Apple gives you voices of similar quality for free), but is up to date with prices.
  Here is a detailed comparison chart I have made that tracks over 100 features across most popular apps: https://speechcentral.net/speech-central-vs-voice-dream-read...
- rhizoma6 months ago
  Yes, I’ve used Voice Dream for years with Pocket articles & ebooks because the Pocket app took up too much space and was limited to web articles. The voice quality is ok for short pieces or stints. The choice of voices is a bit robotic, but I find it useful while making written notes in Split View.
- jdlyga6 months ago
  ElevenLabs Reader is the same thing, but much higher quality voices for free. I've lost my place a few times so it's not quite as reliable as VoiceDream. But you aren't paying an expensive subscription with mediocre voices.
  - ivan_icin6 months ago
    It is free at the moment. They clearly specify in their terms that it won't be completely free in the future. The price of their same product if accessed from the web is $100/month for the amount (barely) sufficient for book reading. It may be smarter to skip until they reveal their official pricing for the mobile app.
- danman26 months ago
  Do you know if it's possible to train it to use my own voice?
- freefaler6 months ago
  Kybook is 1 time payment and can use iOS TTS voices.
- huhtenberg6 months ago
  Another subscription.
  $80/yr.
  Yaaaaaay.
  - treetalker6 months ago
    Unless something has changed, the iOS version is a one-time purchase. I bought the app many years ago (8?) and have been a happy user since.
    Like you, though, I had that reaction to the subscription model for macOS and therefore decided not to "buy" it when it came out.
    huhtenberg6 months ago
    They got greedy and decided to milk it. That's what changed.
    It's $80/yr for the iOS app.
    treetalker6 months ago
    Oof, I believe they changed ownership so I must have been grandfathered in. That's steep.