1 pointby sachuin237 hours ago1 comment
  • sachuin237 hours ago
    I have been working on a problem most language detection libraries quietly fail at: short, messy, conversational text. The kind you see in chat apps, support tickets, SMS, and mixed-language messages.

    FastLangML is my attempt to fix that.

    It is a multi-backend ensemble (FastText, Lingua, langdetect, pyCLD3, and others) with a voting layer built for real-world text. It handles:

    Short messages with almost no statistical signal

    Code switching like Hinglish or Spanglish

    Slang, abbreviations, and emojis

    Multi-turn conversations where context matters

    Confusable languages like ES vs PT or NO vs DK vs SV

    A few design choices:

    Context-aware detection so you can pass conversation history and get more stable predictions

    A hinting system for slang, abbreviations, and custom rules

    Extensible backends so you can plug in your own detectors or voting logic

    Optional persistence using Redis or disk for multi-turn conversations

    Support for more than 170 languages across the ensemble

    Why I built it: most detectors are tuned for long, clean text. They break on "ok", "jaja", "mdr", "brooo", or anything with mixed languages. I needed something that works on real chat data, not idealized text.

    I would love feedback from HN on:

    How you evaluate language detection quality in production

    Whether context-aware detection helps in your workflows

    Ideas for improving code switching accuracy

    Additional backends worth integrating

    Repo: https://github.com/pnrajan/FastLangML

    Happy to share benchmarks, architecture notes, or design tradeoffs if people are interested.