30 pointsby artavazdsm3 hours ago38 comments
  • 1ilit4 minutes ago
    On-device CPU inference is the real flex here. Optimization probably mattered as much as modeling.
  • zkhalapyan14 minutes ago
    Yeh, this would be helpful for the Singlish friends of mine out there!
  • achobanyan19 minutes ago
    Local CPU inference stands out. Careful optimization likely rivaled the modeling effort.
  • MarAraqelyan28 minutes ago
    Really cool to see accent adaptation in real time — curious about benchmarks and how well this handles messy, real Zoom calls
  • aris_hovsepyan2 hours ago
    The real achievement here isn't just quality, it's doing it streaming with tight latency on CPU while preserving speaker identity. Most VC-style work looks great offline, then falls apart once you go real-time. Nice work getting this to hold up in streaming.
  • artavazdsm3 hours ago
    Co-founder of Krisp here. 1.5B non-native English speakers in the workforce, 4x native — yet all comms infra is optimized for native accents. We spent 3 years building listener-side, on-device accent understanding. The hard parts: no parallel training data exists, the accent space is infinite, accent is entangled with voice identity, and it runs on CPU under 250ms latency. Built in Yerevan, Armenia. Beta is live and free. Happy to go deep on the ML side.
    • AlexeyBelov3 hours ago
      What do you think about the misuse potential (by scammers for example)?

      Aside from that, I like that this exists now.

      • davitban hour ago
        This is for listener-side, not speaker-side. So no misuse case here.
  • CyberSec868883 hours ago
    This tackles a massive yet often overlooked gap in global communication.

    A majority of professionals around the world operate in English as a second language, yet most voice technology has historically been designed with native speech patterns in mind. That imbalance creates subtle barriers in everyday conversations, from team syncs to high-stakes business calls.

    Building real-time, on-device accent adaptation, without clean paired datasets, across countless speech variations, while separating pronunciation patterns from speaker identity and keeping latency ultra-low on CPU, is an extraordinary technical achievement.

    Deep respect for taking on something this fundamental to inclusion and clarity in the modern workplace.

  • Ani_Kh1an hour ago
    Curious whether wav2vec-style embeddings played a role in your representation learning.
  • KarineS3 hours ago
    Finally Krisp built it! I will understand my users from interviews better, with no cognitive load and "could you please repeat that" phrasing.
  • amartiro3 hours ago
    The parallel data is a problem here — you can’t crowdsource ground truth because no one can record themselves with a different accent.
  • rasjonell2 hours ago
    Latency can destroy conversational rhythm. What’s your p95 inference time? also are there any benchmarks we can see?
  • lu_mn3 hours ago
    Kinda wild to think accent friction is basically a tech problem. Doing this in real time on CPU sounds tough. Curious how well it holds up in messy, real calls.
  • armsuro3 hours ago
    This feels adjacent to voice conversion research, but with stricter latency constraints.
  • tritont2 hours ago
    Nice to finally see this direction of accent conversion (that is on incoming calls) in the Krisp app. This is a very meaningful feature.
  • snek263 hours ago
    Curious whether wav2vec-style embeddings played a role in your representation learning.
  • nareksardaryann3 hours ago
    Great work. Natural + clear is the combo that matters.
  • sssnowgirl3 hours ago
    This is a game-changer! I remember each and every call I had with an investor and feeling shy asking "can you repeat?"... thanks krisp, you changed my life!!!
  • arshakarap3 hours ago
    This is built for international, privacy-first teams!
  • bebelovejan3 hours ago
    I would like to use such model but only if it really preserves my voice, otherwise people would understand its not me or I have to use it all the time.
  • Tatevik_H2 hours ago
    Streaming constraint under 200ms changes everything. Causal modeling in speech is brutal to get right.
  • gyumjibashyan3 hours ago
    How did you estimate the number of IQ points?
  • aharutyunyan3 hours ago
    Accent space is effectively infinite. Generalization must rely on invariants rather than enumeration.
  • imuradyan3 hours ago
    On-device CPU inference is the real flex here! Optimization probably mattered as much as modeling.
  • Flora_H423 hours ago
    Streaming constraint under 200ms changes everything. Causal modeling in speech is brutal to get right.
  • armb212 hours ago
    This works weirdly well — I’m honestly amazed by how good and fast it is!
  • sohanyan3 hours ago
    Accent space is effectively infinite. Generalization must rely on invariants rather than enumeration.
  • astipili3 hours ago
    will it help the barista in Starbucks get my name right finally?
  • Hripsimeh3 hours ago
    This is a huge game changer !
  • 3 hours ago
    undefined
  • Nathanf223 hours ago
    [dead]
  • nkhachatryanan hour ago
    [dead]
  • liatitanyan2 hours ago
    [dead]
  • melkman423 hours ago
    [dead]
  • Talkative1233 hours ago
    [dead]
  • davtyan963 hours ago
    [dead]
  • eharutyunyan3 hours ago
    [dead]
  • zmkoyan43 minutes ago
    [dead]
  • 3 hours ago
    undefined