3 pointsby shivekkhurana6 hours ago2 comments
  • 6 hours ago
    undefined
  • alephnerd5 hours ago
    This aligns with what I've been thinking and chatting with my peers about - technical documentation would be useful to benchmark performance globally, but I have heard murmurs of it already being used for voice-gen usecases by a WITCH company.
    • shivekkhurana5 hours ago
      The TTS/STT models are actually good and aggressively priced. I personally built a voice-mode ai assistant.

      STT time to first token is ~300ms. ~20 second audio takes less than 1 second to be converted.

      TTS time to first token is ~700ms. ~20 second of audio is generated under 2 seconds.

      • alephnerd5 hours ago
        Absolutely! The TTS/STT approach that Sarvam and the other Indian firms are taking is more intuitive for a larger share of people and usecases. The "replace an SDR" or "replace a call-center" usecase is such an easy win to show POV.

        I feel this is also why you don't see the same degree of hype as you would with the other players. When you are taking an application-driven approach to launching AI products, hype matters less than targeting decisionmakers and showing that your product directly aligns with their outcomes.

        • porridgeraisin3 hours ago
          One other reason STT and OCR (checkout sarvam vision demo on their website, extremely good!) is the focus is to use it to build indian language datasets that can then be used to train larger LLMs than the current 105B one. Most training data in indian languages (you'd know, there are more than just hindi) is in either speech form, or old books.

          If you add in the commercial aspect you pointed out, TTS/STT becomes even more important.