On the other, some of my favorite audio books all stood out because the narrator was interpreting the text really well, for example by changing the pacing during chaotic moments. Or those audiobooks with multiple narrators and different voices for each character. Not to mention that sometimes the only cue you get for who's speaking during dialogue is how the voice actor changes their tone. I have mixed feelings about using this and losing some of that quality.
I would totally use this over amateur ebooks or public domain audiobooks like the ones on project guttenberg. As cool as it is/was for someone to contribute to free books... as a listener it was always jarring to switch to a new chapter and hear a completely different voice and microphone quality for no reason.
This (and everything else with AI) isn't saying "you don't need good actors any more". It's saying "if you don't have an audiobook, you can make a mediocre one automatically".
AI (text, images, videos, whatever) doesn't replace the top end, it replaces the entire bottom-to-middle end.
An embalming tech for our dying civilization.
Printing presses produce superior products.
A mediocre audiobook is certainly better than no audiobook at all, but it is an inferior product to a well produced audiobook.
That seems like a highly dubitable statement. Many hand illuminated manuscripts are masterpieces of art. The advantage of the printing press was chiefly economical making the cost of a copy dramatically less, not an increase in quality (especially so by the aesthetical standards of the time).
I love audiobooks but at this point, most of what I want to listen to is stuff that would not sell enough to bother having someone read.
There are also many voice actors who I simply don't like the way they read.
A future that I can pick a voice that I like for any PDF is a huge upgrade.
I think a problem people have is if on the young side, maybe didn't expect the future to change like this.
No one I knew went on the internet when I graduated high school. Change like this is all par for the course. The only advice I got in high school from a guidance counselor was that I had a nice voice for radio. Books on tape was not exactly a career option at the time. The culture will survive the death of a career path that didn't even really exist when I was a senior in high school.
Even current SOTA models would almost certainly be able to handle multiple speakers and pick-up on the intended tone and intonation.
Don't make the mistake of thinking what we have today is what we will still be working with in 5 or 10 years.
There will be curation and specialization. Previously ignored niches now will be economically profitable. It will be a Renaissance of creativity, and millions of jobs will be created.
There's a lot more to be said for the value of audio books, but the accessibility gains of proliferated auto-generated audiobooks outweigh the downside of losing a small number of expertly produced audio books.
For context, I listen to audio books a lot, and for years I have listened to traditional TTS readings of books too. Better voice generation for books without audiobooks is a great win for society.
Death of a civilization doesn't mean disappearance of mankind or even overall regression on the long term.
(To be clear, nothing is solely and exclusively caused by any one thing. Causality is a very fuzzy concept. But sans printing press, those wars certainly wouldn’t have happened when/where/how they did, if they ever happened at all).
https://ehne.fr/en/encyclopedia/themes/european-humanism/eur...
Without trains, the logistics of canned food isn't much better than the logistics of any bread-based food you give to your soldiers. It doesn't solve the weight problem which is the key problem with preindustrial army logistical issue.
AI is big and significant, but we'll be ok. There is also no such "one" thing as "our civilisation". We're deeply interconnected extremely vast and complex interconnected networks of ever-changing relationships.
AI does indeed represent the commoditisation of things we used to really value like "craftsmanship in book narration" and "intelligence". But we've had commoditisations of similar media in the past.
Paper used to be extremely expensive, but as time went on, it became more and more commoditised.
Memory used to be extremely expensive (2000-3000 years ago, we needed to encode memory in _dance_, _stories_ and _plays_. Holy shit). Now you can purchase enough memory to store a billion books for maybe two hours of labor.
Most of these things don't really matter. What is happening is that the media landscape is significantly shifting, and that is a tale as old as history.
I do think the intellectual class will be affected the most. People who understand this shift stand to benefit enormously, while those who don't _might_ end up in a super awful super low class.
And yet, all of that doesn't really matter if you just move to, I dunno, Paramaribo or whatever. The people there are pragmatic and friendly. They don't care about AI too much. Or maybe New Zealand, or Iceland, or Peru, or Nepal or I don't know.
The world isn't ending. Civilisation isn't being destroyed at our core.
The media landscape is changing, classes are shifting, power-relationships are changing. I suggest you think deeply about where you want to live, what you stand for and what is most important to you in life.
I don't need money or tech to be happy. I am fine with just my cats, my closest friends and family and healthy food.
If it happens to be the case that I need to leave tech or that extremely high-end narrated audiobooks cease to exist? Then all I have to say is "oh no, anyway".
We'll be fine. One way or another.
Just different.
This stance always reminds me of the Profession, a 1957 novella by Isaac Asimov that depicts pretty much the future where there are only top performers and the ignorant crowd.
The "top-enders" are the privileged who need to have some of their gains for their intelligence redistributed to others. The alternative is "survival of the smartest", which is de-facto what we have today and what Young was trying to warn us about.
IMGO(gut opinion), generative AI is a consumption aid, like a strong antacid. It lets us be done with $content quicker, for content = {book, art, noisy_email, coding_task}. There's obvious preconceptions forming among us all from "generative" nomenclature, but lots of surviving usages are rather reductive in relevant useful manners.
Even on the non-fiction side, the narration for Gleick's The Information adds something.
While I want this tool for all the stuff with no narration, NYT/New Yorker/etc replacing human narrators with AI ones has been so shitty. The human narrators sound good, not just average. They add something. The AI narrators are simply bad.
New authors, self-publishers, can't afford tens of thousands of dollars to get an audiobook recorded professionally... This can limit their distribution.
Authors might even choose not to make such version (or lack confidence to record themselves), so AI capable of making a decently passable version would be nice -- something more than reading text blandly. AI in theory could attempt to track the scene and adjust.
I wonder if a standardized markup exists to do so.
With LLMs proving to be very good at generating code, it may be reasonable to assume they can get good at generating SSML as well.
Not sure if there is a more direct way to channel the interpretation of the tone/context/emotion etc from prose into generated voice qualities.
If we train some models on ebooks along with their professionally produced human-narrated audiobooks, with enough variety and volume of training data, the models might capture the essence of that human-interpretation of written text? Just maybe?
Amazon with its huge collection of Audible + Kindle library -- if it can do this without violating any rights -- has a huge corpus for this. They already have "whispersync" which is a feature that syncs text in a kindle ebook with words in corresponding audible audiobook.
Probably the results with a model trained for this plus human audit could lead to very good results.
TortoiseTTS has a few examples under prompt engineering on their demo site: https://nonint.com/static/tortoise_v2_examples.html
But the difference to good audio books is that you have * different voices for the narrator and each character * different emotions and/or speed in certain situations.
I guess you could use a LLM to "understand" and annotate an existing book if there's a markup and then use TTS to create an audio book from it and so automate most of the the process.
"Annotate the following text with speakers and emotions so that it can be turned into an audiobook via TTS", followed by a short text from "The Hobbit" (The "Good morning scene"). The result is very good.
oh, and it's also a boon for those who can't afford to buy audiobooks.
They are also different activities, with audio it's easier to listen to more but retention is usually lower. Not casting any elitist "you need to read" bullshit by the way, but find it odd to define it in terms of lack of time, and I really like both mediums.
there are other factors as well. i love reading so much that i tend to forget time around me. as a result reading would cause me to neglect other duties. i can't allow that, and therefore i am forced to avoid reading. i also don't like long form reading on electronic devices, and as a frequent traveler, printed books are simply not practical and often not even accessible.
i agree with the retention issue, but i found that a much larger factor for retention is how well i can follow the story. a good story that is easy to get into is also easier to retain. and finally, reading fiction is for entertainment. i don't have to retain it.
There's a few categories where it makes sense to roll your eyes, like if they say they have no time to shower or have never been to one of their kid's baseball games.
But for things that aren't basic human expectations, I think you'd have to a real jerk to roll your eyes at someone not having time. No time to cook multi-pot dishes? No time to exercise? No time to read? No time to go to museums? No time to meet at the bar for a drink? Any of them sensible.
No one can do everything, we all make our priorities and its well within their choice not to have any one optional life thing at the top of their personal stack.
Why are you trying to argue about their preference? They didn't cast any judgement on others with different preferences.
This is nothing like “no time for exercise”.
It's more like "I have no time (preference) to fire up the wood stove so I use microwave" and then you come in with "wow so you roll your eyes at us fire stove users?"
Can someone with low vision tell me if this would be useful to them? It may be that specialist tools already do this better.
The real question is "what tools are they already using and how can I make sure those tools are providing higher quality output?". There are standards in browsers for these kinds of things (ways to hint navigation via accessibility tools for example).
Yes, that was my second thought. But I'd rather ask someone than rely on my assumptions.
My example, I was never a Wheel of Time fan, but the new audio editions done by Rosamund Pike are quite the performance, and make me like the story. She brings all the characters to life in a way thats different than just reading. It's a true performance.
Just imagine what this would do for writers. They can get instant feedback and adjust their book for the audiobook.
Anyway, even if in theory it might, in practice things may end even worse than doing it with a monotone voice.
Computer chess took a long time to get better than the best players in the world, but it was better than most chess players for many years before that. We're seeing that a lot with these generative models.
He also narrates another scifi book series and honestly I dislike this a lot.
He became the voice of one particular character for me.
I would love variety
Might be because our brains try to 'feel' the speaker, the emotion, the pauses, the invisible smile, etc.
No doubt models will improve and will be harder to identify as AI generated, but for now, as with diffusion images, I still notice it and react by just moving on..
Take a moment here for a second though and think about it. Even if these voices got to be really good, indistinguishable almost... would I want to listen to it even then? If it was an NPC's generated voice and generated dialogue in a game to help enrich the world building, maybe in that context. On YouTube or with newscasters? Probably not. Audio books? Think I would still rather have it be a real person, because it's like they're reading a story to me and it feels better if it's coming from someone. There's also the unknown factor, where if it's ML generated it's so sterile that the unknowns are kind of gone.
Think about it like this, in the movie industry we had practical effects that were charming in a way. You could think about the physical things that had to occur to make that happen. Movie magic. Now, everything is so CG it's like the magic is gone. Even though you know people put serious hard work into it, there's a kind of inauthenticity and just lack of relevance to the real world that takes something away from it.
It's like a real magician has interesting tricks, while an artificial magician is most likely just a liar.
Still, I grant that it makes some cool things possible and there is potential if things are done right. Some positive mixture of real humans and machine generated stuff so it isn't devoid of anything connected to real life effort.
Future generations will never know a world where you don't watch a 2 hour AI generated orientation video about the wonders of working for Generic Corp when you start a new job.
I mean, I do that because it's correlated with the content being garbage. If I'm intentionally using it on content I want to consume I expect it to be different, though I haven't gotten around to trying it properly yet so I guess we'll see. (OTOH I already listen to ebooks via pre-AI TTS, so I'm optimistic)
> I never said she stole my money
It can have 7 different meanings based on which word you stress out.
The new AI voices sound very natural at a shallow level, but overall pronounce things in odd ways. Not quite wrong, but subtly unnatural which introduces some cognitive load.
Old TTS systems with their monotonic voices are less confusing, but sound very robotic.
Doesn't mean the quality is bad. In fact I think Kokoro's quality is amazing.
But it is not the right tool for narration, the kind of training data they use make the sound too flat, if that makes sense.
Edit: I'll wait to see if any recommendations get made here, if not I might give this one a go: https://github.com/coqui-ai/TTS
I also found DEMUCS + Whisper + pydub to be a super helpful combo for creating quality datasets.
Though according to the TTS leaderboard, Fish Speech https://github.com/fishaudio/fish-speech and Kokoro are higher.
https://jdsemrau.substack.com/p/teaching-your-agent-to-speak...
There's some contemporary discussion of what happened here: https://tidbits.com/2009/03/02/why-the-kindle-2-should-speak...
I think there is still integration with Audible, though. If you buy a book on the Kindle and on Audible, the position will sync, and you can switch between listening and reading without losing your place in the book.
I tried it while on a treadmill so it allowed me to follow the book with more focus without sacrificing much else.
It wasn't a good experience but it was nice to be able to keep 'reading' a book while I was exercising.
It worked for me for over a decade, until I broke the device. I don't know if I never updated the firmware or if the fact I used Calibre to convert books bypassed the feature gate.
It's more of an open problem how to create those epubs. I have some code that can do it using Elevenlabs audio, but I imagine it way harder to have something similar for a human narrator.... who's going to do the sync? Maybe we need a sync AI.
For Android:
- Moon+ reader pro - some paid high-quality TTS voices (like Acapella)
For iOS:
- Kybook reader and internal iOS voices (no external TTS voices for the walled garden)
This works well enough to listen to a book while you walk and when you get back home read on the WC from the place you stopped.
Additionally if you buy a tablet or an android ebook reader, you install the app there an you can continue on your bigger/better device seamlessly.
Whisper-sync for the masses! Ahoy...
What surprised me a good way was my Kindle app was aware of this and asked if I wanted to download the audible version of the current book I am reading.
Been listening on the way to work and then reading on the way back. Enjoying it so far.
Not quite seamless but it works. It has a cursor that follows the words as they’re spoken to, which allows you to read and hear (“immersive reading”) which I find to be extremely helpful for maintaining focus.
- take an ebook in any language - AI translates it to German - AI speaks it using the voice of their fav narrator - a UI showing the text as it is being read
Now they can read Asimov, Kulansky, Bryson, regardless of whether a translation or audio version exists. :)
https://www.theverge.com/2024/5/20/24161253/scarlett-johanss...
But some people could have mistook it due to some regional accent similarities, though it would be akin to interpretation of any light southern drawl with a similar timbre as being SJ.
Sounds like SJ has a better legal team that OAI did.
I once heard an American friend with so-so Japanese ability ask a Japanese woman who had recently had a heart operation how her kokoro was doing, and she looked surprised and taken aback.
Side note: After I started reading HN in 2019, I was struck by how many tech products mentioned here have Japanese names. I compiled a list for a few years and eventually posted it:
I'm not sure if that is related here.
Years ago, when I was dating someone who spoke Russian as one of her native languages, we had to do a funny compromise when watching films together with her parents: they didn't speak a word of English, so we'd use the Russian dub with English subtitles.
I noticed that the Russian dub was just one man reading a translation in a flat voice over what was happening on the screen, no attempts at voice acting or matching the emotions. Usually the dub would have a split second delay to the actual lines, so you'd still hear the original voices for a moment (and also a little bit in the background).
At first I found it very jarring, but they explained that this flatness was a feature. You'll quickly learn to "filter out" the voice while still hearing the translation, and the faint presence of the original voices was enough to bring the emotional flavor back. The lack of voice acting helped with the filtering.
This turned out to apply to me as well, even though I don't speak Russian! My brain subconsciously would filter out the dub, and extract most of the original performance through the subtitles and faint presence of the original voices. Obviously the original version would have been a better experience for me, but it was still very enjoyable.
Of course a generated audiobook is not a dub, as there is no "original voice" to extract an emotional performance from. But some listeners might still be able do something similar. The lack of understanding in the generated voice and its predictable monotony might allow them to filter out everything but the literal text, and then fill it in with their own emotional interpretations. Still not as great as having proper story teller who does understand the text and knows how to deliver dramatic lines, but perhaps not as bad as expected either.
When the foreign movies started to filter into the Soviet Union's illegal movie theatres, you would get 3 or 4 movies playing at once in one room. There would be a TV in each corner of the room and 4 or 5 rows of plastic chairs in front of it in an arch.
ALL of the movies were being revoiced by the same person. So, if you were sitting in the back of the 5th row, you were potentially getting the sound from an action movie, a comedy, a horror movie and a romance at the same time. In the same voice.
You learned to filter really well. So, if that's what they were trained on, watching a single movie must have been very relaxing.
To add on a slight tangent. Many books/audiobooks just don't exist in other languages at all. So even getting some monotone is a lot better than getting nothing.
I think this is where these models really shine. Cheaply creating cross language media and unlocking the knowledge/media to underprivileged parts of the world.
I figured that their opinion probably wasn't universal, hahaha.
And yes, it's at the very least a win for accessibility
I dislike german and russian style dubs as well, I'd rather learn a bit of the original language.
So, it was not just the voice, but the quality control pipeline that was missing as well.
Maybe it mostly works for old plain text books, but if nobody is checking.....
But this one works pretty quick, is easy to install, has some passible voices. Finally I can start listening to those books that have no audio version.
I'm a slow reader, so don't read many books. If a book doesn't have an audiobook version, chances are I won't read it.
PS, I have used elevenlabs in the past for some small TTS projects, but for a full book, it's price prohibitive for personal use. (elevenlabs has some amazing voices)
Thank you to the dev/s who worked on this!
How the hell was it trained on that little data ?
I'm checking what the actual quality is (not a cherry-picked example), but:
Started at: 13:20:04 Total characters: 264,081 Total words: 41548 Reading chapter 1 (197,687 characters)...
That's 1h30 ago, there's no kind of progress notification of any kind, so I'm hoping it will finish sometime. It's using 100% of all available CPUs so it's quite a bother. (this is "tale of a tub" by Swift, it's about half of a typical novel length)
It did finish and result is basically as good as the provided example, so I'd say quite good! I'll plan to process some book before going to bed next time!
Chapter 1 read in 6033.30 seconds (33 characters per second)
Example is Hobbit and Lord of the Rings, the narrator Rob Inglis, makes an amazing voice performance giving depth to environments and characters. And of course the songs!
Depending on what that means, it might be more accurate to say it was trained on 100 hours of audio and with the aid of another, pre-trained model. The reader who thinks “only 100 hours?!” will know to look at the pretraining requirements of the other model, too.
The saddest thing is that people will still continue to participate in consuming these AI produced “goods”.
https://k2-fsa-web-assembly-tts-sherpa-onnx-en.static.hf.spa...
I know it should work for Firefox on an article in reader mode.
Or in MacOS you can select text and have it read out loud.
However easier way to read articles aloud is with Read Aloud extension: https://github.com/ken107/read-aloud.
Guess it was just a matter of time till someone figured out how to use "AI" to resume encouraging illiteracy.
Guess it was just a matter of time till someone figured out how to use "cars" to resume encouraging being unable to to a basic farrier job.
Skills atrophy for a reason. It's fine to let them. You may as well be lamenting the lost art of long division.
It's not the case that it's worse.
I am curious, is there an equivalent light model for speech to text, that can run real-time on the MacBook? I'm just playing around with AI models and was looking into this (a fully locally running app that lets you talk to your computer).
Som audiobooks have this and I think it really makes the experience much more engaging.
(Also maybe some background sound effects but not sure about that, some books also have this and it's quite nice too)
That should actually be possible to do already with existing tech. I haven't seen if you can instruct Kokoro to read in a certain way, does anyone know if this is possible?
https://emosphere-tts.github.io/
We are getting there
https://www.microsoft.com/en-us/research/project/emoctrl-tts...
The odd thing is that while they are releasing these great sounding models, they are not documenting the training process. What we want to know is what magic if any allowed them to create such wonderful voices...
It's one step above "normal" text-to-speech solutions, but not much above it. The epub has "Chapter 1" as the title on the page, and a lot of whitespace, and then "This was...." (actual text). The software somehow managed to ignore all the whitespace and reach "chapter 1 this was.." as a single sentance, no pauses, no nothing.
Blind? A great tool. Will it replace actual audiobooks? Well.. not yet at least.
... audiblez book.epub -l en-gb -v af_sky.
it does not, instead it installs a python package with a cli interface, to run you then have to prepend python and load the module like this:
python3 -m audiblez book.epub -l en-gb -v af_sky.
If you haven't observed this in many other markets, you live an unusual (or unobservant) life.
Here is a detailed comparison chart I have made that tracks over 100 features across most popular apps: https://speechcentral.net/speech-central-vs-voice-dream-read...
$80/yr.
Yaaaaaay.
Like you, though, I had that reaction to the subscription model for macOS and therefore decided not to "buy" it when it came out.
It's $80/yr for the iOS app.