I sometimes see content on social media encouraging people to sound more native or improve their accent. But IMO it's perfectly ok to have an accent, as long as the speech meets some baseline of intelligibility. (So Victor needs to work on "long" but not "days".) I've even come across people who are trying to mimick a native accent but lose intelligibility, where they'd sound better with their foreign accent. (An example I've seen is a native Spanish speaker trying to imitate the American accent's intervocalic T and D, and I don't understand them. A Spanish /t/ or /d/ would be different from most English language accents, but be way more understandable.)
An easy way to improve intonation and fluency is to imitate a native speaker. Copying things like the intervocalic T and D is a consequence of that. It would be easier for a native Spanish speaker to say the Spanish /t/ and /d/ but intonation and fluency would be impaired.
The sounds don't "flow" as they should.
There are lots of variations in English pronunciation. Singaporean, Australian or Scottish native speakers do sound very differently. I don't know to what extent they benefit from adjusting their accent if working in a different English speaking country to match the local dialect.
Also, as a non-native speaker I wonder if it's worth practicing my accent considering that everybody has a different accent anyway. Rather than trying to mimic a north american accent (which I'll never be able to do anyway), I'd be more interested to identify and fix the major issues in my prononciation.
Indeed Victor would likely receive a personalized lesson and practice on the NG sound on the app.
It’s also perfectly fine to want to sound like a native speaker - whether it be because they are self conscious, think it will benefit them in some way, or simply want to feel like they are speaking “correctly”
Sorry to pick on you, it’s just amazing to me how sensitive we are to “inclusivity” to the point where we almost discourage people wanting to fit in
And I've heard other such stories of American schools flagging kids for speech therapy when what they have is an accent. I feel like Americans are actually some of the worst about that.
Besides it’s not like there isn’t correct either - if you’re out in the Midwest what’s correct is just what everyone is speaking.
Its obvious that a kid from Ohio who speaks perfect isn’t going to go to Scotland and speak it “correctly”
Like it’s such low hanging fruit to always be that guy to point out the lowest level, most obvious exception
⸻
1. Thanks to my kids studying French on Duolingo and my joining them, I can no longer say that I’ve never studied it.
You guessed right -- it's /usually/ easier to understand other non-native speakers, both because of accent and less idioms. That is unless the accent is really heavy and doesn't match your own.
Meanwhile the thing that stood out to me in the initial recording were the vowel sounds: for instance, "young" sounded almost like it rhymed with "long" before training. (That makes sense, since Mandarin similarly has a word with that sound, as can be found in the common last name Yang [2].)
Incidentally, Mandarin has words that sound like "lung" (e.g. the word for "cold" [3]), but if you replace the "l" sound at the front with a "y" sound, depending on which of two transformations you use, it turns the vowel sound into a long o [4] (near rhyme with "lone"). (There is another transformation that you can use that results in a leading "y" in pinyin, but in that specific case, the vowel turns into a long e, and the "y" is largely silent (e.g. the word for "solid" [5]).)
In the last recording, Victor is clearly rushing through the sentence, and you can tell that where he previously had a clear "s" ending for the word "days", it's now slurred into a "th" sound. Agreed that that's actually a net negative for intelligibility.
The wiktionary links below have clips of pronunciation. I will note that not all native speakers have a Standard Chinese accent [6,7], so there are assuredly some differences in pronunciation to be expected depending on exactly which region said individual hails from.
[0] https://en.wiktionary.org/wiki/%E6%B5%AA
[1] https://en.wiktionary.org/wiki/long
[2] https://en.wiktionary.org/wiki/y%C3%A1ng#Mandarin
[3] https://en.wiktionary.org/wiki/%E5%86%B7
[4] https://en.wiktionary.org/wiki/%E7%94%A8
[5] https://en.wiktionary.org/wiki/%E7%A1%AC
As a learning platform that provides instruction to our users, we do need to set some kind of direction in our pedagogy, but we 100% recognize that there isn't just 1 American English accent, and there's lots of variance.
You can measure this by mutual intelligibility with other accent groupings.
Yes they are. If we lived in a world where Australia was a world superpower we might be having this conversation upside down and with r’s where there shouldn’t be any, but we don’t. Every student wants to learn to speak with an American accent because it has the highest level of intelligibility owing to exposure via cinema, music, expatriate communities, etc.
Along similar lines, it would be useful to map a speaker's vowels in vowel-space (and likewise for consonants?) to compare native to non-native speakers.
I can't wait until something like this is available for Japanese.
A good accent coach would be able to do much better by identifying exactly how you're pronouncing things differently, telling you what you should be doing in your mouth to change that, and giving you targeted exercises to practice.
Presumably a model that predicts the position of various articulators at every timestamp in a recording could be useful for something similar.
.. unless they had access to a native speaker and/or vocal coach? While an automated Henry Higgins is nifty, it's not something humans haven't been able to do themselves.
Japanese is sort of like this - you have to say foreign words the Japanese way very forcibly, to the point that Americans will think you're being racist if they hear you do it.
Do you have a source for this? It doesn't seem plausible to me, but I'm not an expert.
I assume that, with enough training, we could get similarly accurate guesses of a person's linguistic history from their voice data.
Obviously it would be extremely tricky for lots of people. For instance, many people think I sound English or Irish. I grew up in France to American parents who both went to Oxford and spent 15 years in England. I wouldn't be surprised, though, if a well-trained model could do much better on my accent than "you sound kinda Irish."
Edit: Tried it a few times and also got English as an accent. Pretty fun application!
I'm terrible, according to the program. My Italian is Russian or Hungarian or Swedish, my Australian is English.
New party game unlocked.
This kind of speech clustering has been possible for years - the exciting point with their model here is how it's highly focused on accents alone. Here's a video of mine from 2020 that demonstrated this kind of voice clustering in the Mozilla TTS repo (sadly the code got broken + dropped after a refactoring). Bokeh made it possible to directly click on points in a cluster and have them play
https://youtu.be/KW3oO7JVa7Q?si=1w-4pU5488WxYL3l
note: take care when listening as the audio level varies a bit (sorry!)
I had a forensic linguistics TA during college who was able to identify the island in southeast Asia one of the students grew up on, and where they moved to in the UK as a teenager before coming to the US (if I am remembering this story right).
From what I gather, there are a lot of clues in how we speak that most brains edit out when parsing language.
But then I read their privacy policy. They want permission to save all of my audio interactions for all eternity. It's so sad that I will never try out their (admittedly super cool) AI tech.
Yeah, I can opt out. By not using any voice-related feature in their voice training app.
This verb is used for living organisms, and AFAIK, AI is not living, nor an organism.
This kind of subtle yet nasty personification is what encourages people to believe more in the fact that an AI "thinks" by itself or by extension that "AGI" is really close, or even worse, tricks more easily people into believe AI outputs.
"AI detects" would have been way more suitable.
That said, I found the recording of Victor's speech after practicing with the recording of his own unaccented voice to be far less intelligible than his original recording.
Looking forward to seeing the developments in this particular application.
Interesting to note that we're also developing a separate measure of intelligibility that will give a separate sense of how intelligible versus accented something is.
Just had an employee at our company start expensing BoldVoice. Being able to be understood more easily is a big deal for global remote employees.
(Note - I am a small investor in BoldVoice)
A suggestion and some surprise: I’m surprised by your assertion that there’s no clustering. I see the representation shows no clustering, and believe you that there is therefore no broad high-dimensional clustering. I also agree that the demo where Victor’s voice moves closer to Eliza’s sounds more native.
But, how can it be that you can show directionality toward “native” without clustering? I would read this as a problem with my embedding, not a feature. Perhaps there are some smaller-dimensional sub-axes that do encode what sort of accent someone has?
Suggestion for the BoldVoice team: if you’d like to go viral, I suggest you dig into American idiolects — two that are hard not to talk about / opine on / retweet are AAVE and Gay male speech (not sure if there’s a more formal name for this, it’s what Wikipedia uses).
I’m in a mixed race family, and we spent a lot of time playing with ChatGPT’s AAVE abilities which have, I think sadly, been completely nerfed over the releases. Chat seems to have no sense of shame when it says speaking like one of my kids is harmful; I imagine the well intentioned OpenAI folks were sort of thinking the opposite when they cut it out. It seems to have a list of “okay” and “bad” idiolects baked in - for instance, it will give you a thick Irish accent, a Boston accent, a NY/Bronx accent, but no Asian/SE Asian accents.
I like the idea of an idiolect-manager, something that could help me move my speech more or less toward a given idiolect. Similarly England is a rich minefield of idiolects, from scouse to highly posh.
I’m guessing you guys are aimed at the call center market based on your demo, but there could be a lot more applications! Voice coaches in Hollywood (the good ones) charge hundreds of dollar per hour, so there’s a valuable if small market out there for much of this. Thanks for the demo and write up. Very cool.
I’d consider making this feature available free with super low friction, maybe no signup required, to get some viral traction.
This is offensive :))
If so—and if you want to transfer-learn new downstream models from embeddings—then seems to me you are onto a very effective way of doing data augmentation. It's expensive to do data augmentation on raw waveforms since you always need to run the STFT again; but if you've pre-computed & cached embeddings and can do data augmentation there, it would be super fast.
I’d be really interested to play with this tool and see what it thinks of my accent. Can it tell where I grew up? Can it tell what my parents’ native languages are (not English!)
A free tool like this would be great marketing for this company.
Best kind of AI in the best kind of a free market.
Also, the USA writing convention falls short, like "who put the dot inside the string."
crazy. Rationals "put the dot after the string". No spelling corrector should change that.
I’ve been using it for a few months, and I can confirm it’s working.
On the one hand, the tech is impressive, and the demo is nicely done.
On the other, I think the demo completely misses the point. There's a disconnect between what learners need to learn and what this model optimises for, and it's probably largely explainable by how difficult (maybe even impossible) getting training datasets is. That, and marketing.
I believe most learners optimise for two things: being understood [1] and not being grating to the ear [2]. Both goals hinge on acquiring the right set of phonemes and phonetic "tools", because the sets of meaningfully distinct sounds (phonemes) and "tools" rarely match between languages.
For example, most (all?) Slavic languages have way fewer meaningfully distinct vowels than English. Meaningfully distinct is the crucial part. Russian word "молоко" as it's most often pronounced has three different vowels, at least two of which would be distinct to an English speaker, but Russian speakers hear that as one-ish vowel. And I mean "hear it": it's not a conscious phenomenon! Phoneme recognition is completely subconscious, so unless specifically trained, people often don't hear the difference between sounds that are obviously different to people who speak that language natively [3].
Same goes for phonetic "tools". English speakers shorten vowels when followed by non-voiced consonants, which makes "heart" and "hard" distinguishable even when t/d are transformed into the same sound (glottal stop or a tap). This "tool" is not available in many languages, so people use it incorrectly and it sounds confusing.
So, how would ML models learn this mapping between sounds and phonemes, especially when it's non-local (like with the preceding vowel length)? It's relatively straightforward to find large sets of speech samples labelled with their speakers' backgrounds, but that's just sounds, not phonemes! There is very little signal showing which sound structures matter for humans listening to the sound and which don't. [4]
There's also a set of moral issues connected to the "target accent" approach. Teaching learners to acquire an accent that superficially sounds like whatever they chose as a "target" devalues all other accents, that are just as valid and are just as English, because they have the same phonetic system (phonemes + "tools"). It can also make people sound a bit cringe, which I saw first hand.
Ideally learners should learn phonetic systems, not superficial accents. That's what makes speech intelligible and natural, even if it's has an exotic flavour [5][6]. Systems like the one the company is building do the opposite. I guess they are easier to build and easier to sell.
[1]: On that path lies a nice surprise: being understood and understanding are two sides of the same medal, so learning how to be understood a language learner inevitably starts to understand better. Being able to hear the full set of phonemes is the key to both.
[2]: There's a vast, VAST difference between people not paying attention to how someone speaks and them not being able to tell that something's off when prompted.
[3]: Nice demonstration for non-Hindi speakers: https://www.youtube.com/watch?v=-I7iUUp-cX8 When isolated and spoken slowly, the consonants might sound different, but in normal speech they sound practically indistinguishable to English speakers with no prior exposure. Native speakers would hear the difference as clear as you would in cap/cup!
[4]: Take their viral accent recognition demo. Anecdotally, among three non-native speakers with different backgrounds I talked to, the demo was guessing the mother tongue much better than native speakers, and it errors were different. This is a sign of the model learning to recognise the wrong things.
[5]: Ever noticed how films almost always cast native English speakers imitating non-English accents rather than people for whom that's their first language? That's why, English phonetic system with sprinkles of phonetic flavour is much more understandable.
[6]: By the way, Arnold Schwarzenegger understands this very well.
Not all languages can be neatly split into a nice set of phonemes - Danish phonology in particular seems mostly imaginary, and the "insane grammar" of Old Irish appears to result from the fact that word/morpheme boundaries can occur within the "phonemes".
That group has a vast range of accents, but it's believable that that range occupies an identifiable part of the multi-dimensional accent space, and has very little overlap with, for example, beginner ESL students from China.
Even between native speakers, I bet you could come up with some measure of centrality and measure accent strength as a distance from that. And if language families exist upon a continuum - there must be some point on that continuum where you are no longer speaking English, but say Scots or Friesian or Nigerian Creole instead. Accents close to those points are objectively stronger.
But there is a lot of freedom in how you measure centrality - if you weight by number of speakers, you might expect to get some mid-American or mid-Atlantic accent, but wind up with the dialect of semi-literate Hyderabad call centre workers.
> Even between native speakers, I bet you could come up with some measure of centrality and measure accent strength as a distance from that
Is that what BoldVoice is actually doing? At least from the article is saying, it is measuring the strength of the user's American English accent (maybe GenAm?), and there is no discussion of any user choice of native accent to target.
No, I don't think it is doing that, I'm just taking issue with cccpurcell, who seems to believe that any definition of accent strength is chauvinistic.
Yes, that is a good definition of accent strength.
> There's no such thing as accent strength.
??! You literally just defined it.
I'm not American so I don't want to comment on that.