Some reports, which I can find and post here if necessary, claim this can lead to a 40% or so overall performance difference.
There is also a view that due to the way complex meanings are encoded in a pictograph type language that it improves the inference stage and ultimately greatly reduces hallucinations.
There has been some work from Microsoft and others to compress tokens from the user side. Other papers have suggested the advantages are so great that a new kind of symbol based language should be created to use for all of the training data.
Does anyone have any experience with this sort of LLM optimization? Is Mandarin and similar languages more efficient for LLMs?
IF that was true, then it's interesting because Chinese is anything but the precise language something like Z or Coq or APL tries to be, words have remarkably fluid meaning, highly contextualised. The opportunity for a mis-walk through the information space seems higher, not lower.
Sometimes a cigar is just a cigar, but Honey and Winnie the Pooh have two clear meanings now in China. As does Draco Malfoy. I can't see how this helps an LLM.
(I'm an AI skeptic, and a complete outsider in this space)
Your examples aren’t language specific though. I doubt English doesn’t have words that have twisted meanings now.