Could Sarvam 30B/105B Models Be India's Answer to DeepSeek and Mistral?(shivekkhurana.com)

3 pointsby shivekkhurana6 hours ago2 comments

6 hours ago
undefined
alephnerd5 hours ago
This aligns with what I've been thinking and chatting with my peers about - technical documentation would be useful to benchmark performance globally, but I have heard murmurs of it already being used for voice-gen usecases by a WITCH company.
- shivekkhurana5 hours ago
  The TTS/STT models are actually good and aggressively priced. I personally built a voice-mode ai assistant.
  STT time to first token is ~300ms. ~20 second audio takes less than 1 second to be converted.
  TTS time to first token is ~700ms. ~20 second of audio is generated under 2 seconds.
  - alephnerd5 hours ago
    Absolutely! The TTS/STT approach that Sarvam and the other Indian firms are taking is more intuitive for a larger share of people and usecases. The "replace an SDR" or "replace a call-center" usecase is such an easy win to show POV.
    I feel this is also why you don't see the same degree of hype as you would with the other players. When you are taking an application-driven approach to launching AI products, hype matters less than targeting decisionmakers and showing that your product directly aligns with their outcomes.
    porridgeraisin3 hours ago
    One other reason STT and OCR (checkout sarvam vision demo on their website, extremely good!) is the focus is to use it to build indian language datasets that can then be used to train larger LLMs than the current 105B one. Most training data in indian languages (you'd know, there are more than just hindi) is in either speech form, or old books.
    If you add in the commercial aspect you pointed out, TTS/STT becomes even more important.