482 pointsby thm8 days ago24 comments
  • Arcuru7 days ago
    From the paper, I was intrigued by how they handled their RL step for Code Data. They trained against hard but solvable code generation tasks by running unit testing. Is that training step done by the other models?

    > Code Data For coding problems, we curate a high-quality training set comprising open-source datasets and our newly collected problem set. We remove problems without test cases. For problems with golden solutions, we exclude those where the golden solution failed to pass all test cases. For problems without golden solution, we discard problems where no test case can be solved in 16 rollouts of advanced reasoning models. Similar to math data, we utilize an SFT version of MiMo-7B to filter out easy problems that are perfectly solved in all 16 rollouts. This rigorous cleaning process yields 30K code problems.

    > During each RL iteration, we evaluate thousands of problems to compute the rewards, with each problem potentially containing hundreds of test cases. To improve reward computing efficiency and eliminate GPU idle time, we developed an online judge environment that enables parallel execution of extremely high-volume unit tests.

    • loufe6 days ago
      Is any RL done without unit testing? I would be surprised to hear that that wasn't the case, as it would imply a disregard for accuracy for other model makers, which would be surprising. Perhaps you can do this for small modular problems but not for a problem with a 200k token input?
  • lvl1557 days ago
    Why are there so many English-first AI models from China? Are they not interested in serving their own population? Or is it that if they publish Chinese-first models it won't get publicity in the West?
    • throwup2387 days ago
      CommonCrawl [1] is the biggest and most easily accessible legally acquired crawling dataset around, collecting data since 2008. Pretty much everyone uses this as their base dataset for training foundation LLMs and since it's mostly English, all models perform well in English.

      [1] https://commoncrawl.org/

    • whynotmaybe7 days ago
      Haven't we reached a situation where English is the de facto language of scientific research, especially AI benchmarks ?

      It's clearly impossible for me to try anything in Chinese, I'd need a translation.

      • xmichael9097 days ago
        Correct. Lingua franca for at least the last 75 years, if not longer.
        • numpad07 days ago
          For publishing results, yes, but not necessarily for the generation part of it.
      • unsupp0rted6 days ago
        Less and less, it feels like, every year. I wonder if anybody has hard numbers on that.
    • julianozen7 days ago
      One thing I thought was interesting about this paper [1] on understanding LLMs was how the models associate words/concepts in different languages with each other in what they call Multilingual Circuits.

      So the example they give:

      English: The opposite of "small" is " → big

      French: Le contraire de "petit" est " → grand

      Chinese: "小"的反义词是" → 大

      Cool graphic for the above [2]

      So while English is the lingua franca of the interenet and represents the largest corpus of data, the primary models being built are able to use an English dataset to build associations across languages. This might create significantly stronger AI and reasoning even for languages and regions that lack the data, tech and resources to build local models

      [1] https://www.anthropic.com/research/tracing-thoughts-language...

      [2] https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-...

    • enlyth7 days ago
      I assume a large portion of high quality training material is in English
      • sigmoid107 days ago
        You'd be correct. The largest portion of all languages in Common Crawl (aka the "whole open internet" training corpus) is English with 43%. No other language even reaches double digit percentages. The next biggest one is Russian at 6%, followed by German at 5%.
        • Svoka7 days ago
          I wonder where are you getting your data. According to wikipedia russian is #7 https://en.wikipedia.org/wiki/Languages_used_on_the_Internet

          Only place where russian is in top 5 is in Wikipedia views. Russian part of internet steadily goes down, as russian imperialism crumbles.

          • div727 days ago
            > The largest portion of all languages in Common Crawl

            https://commoncrawl.github.io/cc-crawl-statistics/plots/lang...

            • Svoka7 days ago
              Thanks!

              I wonder where this discrepancy comes from

              • tough7 days ago
                probably under-indexing of non-english sources by these crawlers.

                would be interesting if yandex opened some data sets!

                • simongray7 days ago
                  And lots of people write on the web using English as a second language, which both reduces the presence of their native language and increases the presence of English.
                  • tough6 days ago
                    yep not a native english speaker here and yet my online footprint is mostly english due to software pushing me to learn it
              • numpad06 days ago
                My guess is that reference counting at depth=1 only capture non-$LANG content which text parts don't matter a lot, e.g. photo galleries.
    • yyhhsj05217 days ago
      Chinese internet mostly consists of a few closed gardens tightly controlled by big corps. Crawlers simply don't work when each company employs an army of engineers to guard their data. Many of the most popular websites are also app only. It's impossible to get the corpus necessary to train a good LLM.
      • AlexCoventry7 days ago
        DeepSeek claims they had 12% more Chinese tokens than English, in their training corpus for DeepSeek V2, FWIW.

        https://arxiv.org/pdf/2405.04434#page=12

        > Our tokenized pretraining corpus contains 8.1T tokens, where Chinese tokens are approximately 12% more than English ones.

      • bredren7 days ago
        Do we have estimates on the corpus that is available? This model's repo describes "multiple strategies to generate massive diverse synthetic reasoning data." FWIW, AI 2027 forecasts heavy emphasis on synthetic data creation.

        Is the lack of existing corpus just an extra hurdle for Hanzi-first models that are also leading the pack in benchmarks?

    • chvid7 days ago
      All LLMs are trained on the same basic blob of data - mostly in English, mostly pirated books and stuff.
      • eru6 days ago
        That's wrong.

        Many LLMs are trained on synthetic data produced by other LLMs. (Indirectly, they may be trained on pirated books. Sure. But not directly.)

        • loufe6 days ago
          Likely the case for established model makers, but barring illegal use of outputs from other companies' models, a "first generation" model would still need this as a basis, no?
          • eru6 days ago
            Why illegal? The more open models (or at least open-weight models) should allow using their outputs. Details depend on license.

            But yes, 'first generation' models would be trained on human text almost by definition. My comment was only to contradict the claim that 'all LLMs' are trained from stolen text, by noting that some LLMs aren't trained (directly) on human text at all.

    • Barrin927 days ago
      >Or is it that if they publish Chinese-first models it won't get publicity in the West?

      This is a large part of it. Kai-Fu Lee's company (https://www.01.ai/) has been publishing open source Chinese language/market focused models pretty early, but the entire conversation around Chinese tech just isn't available to you if you don't speak Chinese, in particular these days given that good English language reporting on the Chinese tech sector just seems very scarce.

    • Leary7 days ago
      They are not "English-first". Deepseek-R1, for example, reasons in Chinese when you ask it a question in Chinese.
      • eru6 days ago
        I've seen one of the ChatGPT models produce the occasional Chinese phrase even when otherwise reasoning in English about a problem given in English.
      • HPsquared6 days ago
        Does that apply in other languages too, like French?
    • choutianxius7 days ago
      One reason is that there is no "good" search engine in China. The most popular one, Baidu, is like garbage compared to Google search. The most useful training data in Chinese would likely be from the social media and video sharing platforms, which I guess is much more difficult to crawl and clean up.
      • thoroughburro7 days ago
        A few thousand years of literature ain’t nothing…
        • kccqzy7 days ago
          Peanuts compared to the discourse available on the internet.

          The literature that survived thousands of years are cream of the crop; you won't find lots of random unimportant dialog between people thousands of years ago, but you find that on Reddit.

        • fwipsy7 days ago
          Given premodern population sizes and literacy rates, historical texts probably don't exist in anything like the quantity that internet posts do. Even if they did, the information may not be relevant to the modern world.
      • littlestymaar7 days ago
        > The most popular one, Baidu, is like garbage compared to Google search

        It must be very bad when you see the walking turd that Google search has become over the years…

        • sidibe7 days ago
          It is. In Chinese speaking countries where there's google available, no one is using Baidu
          • hnfong6 days ago
            There's only ONE* Chinese speaking country, at least if you only count those that have a Chinese speaking majority population, or uses Chinese as the official language.

            * for various interpretations of one.

            • ii414 days ago
              Chinese is one of the offical languages of Singapore.
          • saagarjha7 days ago
            Do any of those countries have a good relationship with China and/or countries from there?
            • eru6 days ago
              Singapore has a pretty good relationship with China (with all Chinas, actually). And we have plenty of Chinese speakers, too. I'm not sure how prevalent Baidu is, however.
    • Havoc7 days ago
      I was under the impression that we just see the English stuff given that we're using English news channels.
    • paulsutter7 days ago
      I don’t see any indication that it’s English-first?
    • lwansbrough7 days ago
      I’m going to go with: to ensure it is not disadvantaged in benchmarks
    • spacebanana77 days ago
      I wonder whether English text having fewer characters provides an advantage somehow.
      • jmole7 days ago
        not really, since tokenization combines multiple characters
    • revskill7 days ago
      Chinese is hard.
    • overfeed7 days ago
      Why are so many American models multi-lingual, supporting hundreds of languages not commonly spoken in the United States?

      Could it be that being multilingual results in a larger pool of human knowledge on the technical side compared to training on just a single language or 2. And on the business side, supporting more languages results in a larger TAM (total addressable market). Using english-language dataset for training LLMs is the default, not the other way like you insinuate.

      • achierius7 days ago
        That's clearly a different question. It'd be possible for these models to be Mandarin-first while still supporting other languages, like American models are English-first while doing the same, but that's not what's happening.
        • overfeed7 days ago
          > That's clearly a different question. It'd be possible for these models to be Mandarin-first while still supporting other languages

          What would a hypothetical "Mandarin-first" model look like to you?

          I challenge the notion that the current models are "English-first" - that is an unsubstantiated opinion not supported by fact. I bet, dollars to donuts, these models are SoTA in Mandarin as well. When framed that way, asking "Why are they marketed as English-speaking models outside of China" or "Why are they really good at English" are simply not interesting questions - they have obvious answers.

          • yorwba7 days ago
            > What would a hypothetical "Mandarin-first" model look like to you?

            Given a language-agnostic prompt like "12 + 89", any explanatory text it outputs could be expected to be in Mandarin most of the time.

            According to this test, Xiaomi's MiMo-7B-RL is an English-first model.

            • overfeed6 days ago
              "12 + 89" uses the latin alphabet and is in no way language-agnostic in this context. I expect borrowed constructs to appear relatively more frequently in the language they were borrowed from.

              Now I'm curious how Mistral models would respond to a "language-agnostic" phrases like "Rendezvous" or "coup d'etat"

              • yorwba6 days ago
                You may think of these symbols as "Latin" because they're how people writing in Latin script happen to write mathematical expressions, but the exact same symbols are also used by Mandarin speakers, as well as in numerous other scripts. Writing math in Chinese characters is literally as uncommon as someone writing "twelve plus eighty-nine" in English.

                In contrast, your examples would be spelled « rendez-vous » and « coup d’État » in French, i.e. easily distinguishable from their English descendants.

                • overfeeda day ago
                  > You may think of these symbols as "Latin" because they're how people writing in Latin script happen to write mathematical expressions

                  No need for scare-quotes, Latin script is a proper noun and a technical term with precise meaning wrt text encoding - not "what I think."

                  > the exact same symbols are also used by Mandarin speakers, as well as in numerous other scripts. Writing math in Chinese

                  Which unicode code points do the Mandarin speakers and "numerous other scripts" use to write "12 + 89"? Could it be the very same code points as Latin script, which then are tokenized to the same vectors that the LLMs learn to associate more with English text rather than CJK in the latent space?

                  > i.e. easily distinguishable from their English descendants.

                  You're making broad assumptions about the tokenization design here that do not apply universally.

                  • yorwba8 hours ago
                    Precisely because the exact same codepoints are used for digits and mathematical symbols, there's nothing script-specific about them and their linguistic association is determined by the training data mixture. A model trained predominantly on text scraped from Chinese websites would learn to associate them more with Mandarin than English in the latent space, since that would be the context where they most often appear.
    • mensetmanusman7 days ago
      English won. The Chinese youth struggle to write their own calligraphy characters they can read now. Typing favors English.
      • rahimnathwani7 days ago
        It's easy and fast to type Chinese sentences using a keyboard.
      • -__---____-ZXyw7 days ago
        Source?

        This smacks of "I saw a headline once"-itis. Especially the fact that you refer to the Chinese characters as "calligraphy characters", as if that were the general term or something.

        • Jarwain7 days ago
          These are probably the headlines they're thinking about,

          https://www.globaltimes.cn/content/747853.shtml

          https://www.bbc.com/news/blogs-china-blog-28599392

          Or more recently this one about character amnesia

          https://globalchinapulse.net/character-amnesia-in-china/

          None of these really mean that English has won, though. Rather that phonetics-based writing systems are easier to remember and use, especially in conjunction with digital systems that make it easy to map sound and context to symbols.

          I wouldn't be surprised if characters are faster to read though. In English we have all these subconscious shortcuts like looking at the shape of the word, first and last letters, etc. But I think symbology can convey more at a glance. Thus the popularity of emoji

          • -__---____-ZXyw7 days ago
            Ah no, I know myself that there have been headlines here and there.

            I'm pretty sure there was some controvery in the linguistic blogging community even at some stage over the last couple of years, with someone writing an essay claiming the Chinese character system was in some sense less advanced and maybe on the way out, and this leading to a serious response or two, the usual fiery academic affair. I can't locate it this instant though.

            I moreso meant for OP's low-effort dramatisation to not go unanswered. Framing it as "winning" some sort of language battle is particularly silly.

            Your musings are interesting though, and the topic certainly is a fascinating one. Languages that use morphemes for writing are wild. Symbology is a cool word also - surely there has to be a lisp blog somewhere with that word in the title.

      • throwaway5197 days ago
        The pendulum already turned back. The current generation under 20 grew up with touchscreens. That obseletes input with pinyin; many don't care if the device has no keyboard.
        • thenthenthen7 days ago
          Input is so interesting in China, basically a sorta t9 but just single letters and picking the right characters, with common/frequently used characters first, using pinyin. For example to say “ How are you?” You just type “nhm” (Ni Hao Ma) and 你好吗 shows up as suggestion/autofill. You can make surprisingly long sentences using this method.
        • olalonde7 days ago
          > That obseletes input with pinyin

          Uh? Pinyin input is by far the most popular input technique in China. I rarely see anyone using handwriting input.

          That being said, it has nothing to do with English winning. It's just a Chinese input technique that uses the latin alphabet. English fluency in China is not very common, especially spoken English.

          • syndeo6 days ago
            My father-in-law here in China uses handwriting input, but everyone else I've seen here uses Pinyin, and it's totally fast and natural for them.

            And very true about the English. With some exceptions (of course), folks here maybe know a handful of words at best, and even then, pronunciation is usually pretty rough. People here really aren't using it; they are perfectly comfortable with their Chinese, and why wouldn't they be?

            Anyone saying otherwise clearly hasn't been here to see it firsthand.

          • eru6 days ago
            Just like German is written with almost the same alphabet as English, but that doesn't give you English fluency.
        • anticensor7 days ago
          If only Unicode decomposed Chinese characters on a per basic stroke basis: it would be so much easier to have keyboards following that.
          • ezst7 days ago
            This has nothing to do with Unicode and everything to do with input methods, of which there are a variety. Some methods are indeed shape-based like you suggest: https://en.wikipedia.org/wiki/Chinese_input_method#Shape-bas...

            By the looks of it, Pinyin (a phonetic one) won by a landslide, which I suspect this is the result of a long effort by the Chinese government to install Mandarin as the official language of China, above regional dialects (different regions would write similar characters but pronounce them differently, and defaulting to Pinyin has this "nice" effect of having people "think of how it would be pronounced in Mandarin first", even when the result are characters that would be read by a Cantonese speaker).

        • pertymcpert7 days ago
          What? Only people I've seen use the writing input mode was old people.
    • 346797 days ago
      Nearly everyone in the urban areas of China spoke some English when I visited way back in 1995. It's a bilingual society.
      • crazygringo7 days ago
        This is not true. I was in Beijing around then and never met a single person who spoke English if they hadn't learned it for professional reasons (they worked in tourism, international business, etc.).

        It could not have been further from a bilingual society.

      • gcy7 days ago
        I suppose you probably were visiting some university districts/CBDs where people likely to have received higher education. Elsewhere, aside from basic "hello"/"how are you", locals in general are not able to communicate in English.
      • rahimnathwani7 days ago
        I lived in Beijing and Shanghai for 9 years (2010-2019) and this is NOT my impression at all.
      • syndeo6 days ago
        Not sure which part you were in, but this is just not true in my experience. I've been to Beijing, Shenzhen, Guangzhou, and some others, and Mandarin really is a must if you want to even have a chance of communicating. I can't imagine how I'd function here if I only had English.

        I've not yet been to Shanghai, and while I would expect the English-speaking percentage to be a bit higher, it would still likely only be in the single-digits by my estimation.

    • bilbo0s7 days ago
      The mandarin language models obviously exist, but what would you do with them if they provided access to them? And what knowledge would be in them? What is the body of knowledge encoded in Mandarin? What does that look like?

      Sad reality is that not many outside of China have the facility with Mandarin to use those models. Even non-native Mandarin speakers who claim to be "fluent", are often messing up intended meaning in text. Or making literal translations that wind up making no sense.

      Inside of China, llm use will be Mandarin based. Outside, it seems to me English is the natural choice.

      Irony of Irony, probably the best way for a non Mandarin speaking layman to test a Mandarin based model would be to use another LLM to translate prompts to Mandarin.

      It's a sad future we're looking at.

      Or a brilliant one.

      Time will tell.

      • johnla7 days ago
        For it to be brilliant, AI needs to be a benevolent tool all the time. It would take just a few malignant actors to turn our world upside. I suspect it'll follow the same Internet and social media path. Great at first, grow markets, bring us together and then take a turn.
        • horacemorace7 days ago
          You’re right of course. That’s why these open source / weight releases are so critically important.
      • eru6 days ago
        > Even non-native Mandarin speakers who claim to be "fluent", are often messing up intended meaning in text. Or making literal translations that wind up making no sense.

        Happens with English as well, but non-native speakers of English still benefit from these models.

  • siliconc0w7 days ago
    This is incredibly strong coding performance for a 7b. I use Gemini Pro 2.5 which got 67.8 and this got 57.8, very close to Gemini 2.5 Flash which got 60.6.

    I've become pretty skeptical about eval results given what we've heard about llama4 so we'll see where this lands on the closed evals but very impressive to see.

  • jedisct17 days ago
    GGUF version (for LM Studio, Ollama, etc): https://huggingface.co/jedisct1/MiMo-7B-RL-GGUF
  • rahimnathwani7 days ago
    When you guys use gguf files in ollama, do you normally create a modelfile to go with it, or just hope that whatever default ollama has work with the new model?

    https://github.com/ollama/ollama/blob/main/docs%2Fmodelfile....

    • Havoc7 days ago
      One of the core design goals Georgi Gerganov had with GGUF was to not need other files. It's literally bullet point #1 in the specs

      >Single-file deployment

      >Full information: all information needed to load a model is contained in the model file, and no additional information needs to be provided by the user.

      https://github.com/ggml-org/ggml/blob/master/docs/gguf.md

      We literally just got rid of that multi file chaos only for ollama to add it back :/

      • rahimnathwani7 days ago
        Most of the parameters you would include in ollama's ModelFile are things you would pass to llama.cpp using command line flags:

        https://github.com/ggml-org/llama.cpp/blob/master/examples/m...

        If you only ever have one set of configuration parameters per model (same temp, top_p, system prompt...), then I guess you can put them in a gguf file (as the format is extensible).

        But what if you want two different sets? You still need to keep them somewhere. That could be a shell script for llama.cpp, or a ModelFile for ollama.

        (Assuming you don't want to create a new (massive) gguf file for each permutation of parameters.)

        • novaRom7 days ago
          This is why we use xdelta3, rdiff, and git
    • monkmartinez7 days ago
      If you ollama pull <model> the modelfile will be downloaded along with the blob. To modify the model permanently, you can copypasta the modelfile into a text editor and then create a new model from the old modelfile with the changes you require/made.

      Here is my workflow when using Open WebUI:

      1. ollama show qwen3:30b-a3b-q8_0 --modelfile

      2. Paste the contents of the modelfile into -> admin -> models -> OpenwebUI and rename qwen3:30b-a3b-q8_0-monkversion-1

      3. Change parameters like num_gpu 90 to change layers... etc.

      4. Keep | Delete old file

      Pay attention to the modelfile, it will show you something like this: # To build a new Modelfile based on this, replace FROM with: # FROM qwen3:30b-a3b-q8_0 and you need to make sure the paths are correct. I store my models on a large nvme drive that isn't default ollama as an example of why that matters.

      EDIT TO ADD: The 'modelfile' workflow is a pain in the booty. It's a dogwater pattern and I hate it. Some of these models are 30 to 60GB and copying the entire thing to change one parameter is just dumb.

      However, ollama does a lot of things right and it makes it easy to get up and running. VLLM, SGLang, Mistral.rs and even llama.cpp require a lot more work to setup.

      • rahimnathwani7 days ago
        Sorry, I should have been clearer.

        I meant when you download a gguf file from huggingface, instead of using a model from ollama's library.

        • monkmartinez7 days ago
          ollama pull hf.co/unsloth/Qwen3-30B-A3B-GGUF:Q4_K_M and the modelfile comes with it. It may have errors in the template or parameters this way. It has to be converted to GGUF/GGML prior to using it this way. You can, of course, convert and create the specific ollama model from bf16 safetensors as well.
          • rahimnathwani7 days ago
            Yeah when I do this, the modelfile has only FROM and TEMPLATE. No PARAMETERs:

              ollama pull hf.co/jedisct1/MiMo-7B-RL-GGUF:Q4_K_M
              ollama show --modelfile hf.co/jedisct1/MiMo-7B-RL-GGUF:Q4_K_M
          • 7 days ago
            undefined
      • o11c7 days ago
        Pretty sure the whole reason Ollama uses raw hashes everywhere is to avoid copying the whole NN gigabytes every time.
        • monkmartinez7 days ago
          Maybe I am doing something wrong! When I change parameters on the modelfile, the whole thing is copied. You can't just edit the file as far as I know, you have to create another 38GB monster to change num_ctx to a reasonable number.
          • o11c7 days ago
            The parameters (prompt, etc.) should be set only in the new modelfile (passed to `ollama create`), using a FROM referencing the previous ollama model. Parameters in a Modelfile override the hard-coded parameters from the GGUF itself (which are sometimes buggy); in fact from elsewhere in the thread it sounds like Mimo is missing proper stop tokens, or maybe templates in general; I'm not an expert).

            This will show a separate entry in `ollama list` but only copy the Modelfile not the GGUF.

            Alternatively, if you use the API, you can override parameters "temporarily". Some UIs let you do this easily, at least for common parameters.

    • memhole7 days ago
      I’ll typically use the defaults initially and then use a Modelfile if it’s something I plan on using. I think you can dump the modelfile ollama uses to have a template to work with.
  • gizmodo597 days ago
    Its funny to see benchmarks where they omit the top performing models like O3 (Which is the best model in many benchmarks currently) and Gemini Pro/Claude 3.7.
    • daveguy7 days ago
      Those are much much larger models, and they are proprietary. Those model providers just don't have the distilled versions identified and available.

      Notice most of the models they are comparing with are 7B models. The exception is also an open weights model (Qwen-2.5-32B-RL-Zero). Even with 32B parameters the MiMo-7B outperforms it.

    • erikig6 days ago
      I believe the goal here is to compare them against similar models that are being optimized to run offline or on mobile hardware.
  • badmonster7 days ago
    MiMo-7B claims to outperform larger models like Qwen-32B and match OpenAI o1-mini on math/code benchmarks — all with a 7B model trained from scratch. Is this a sign that pretraining + RLHF optimization is finally outpacing scale? Or are we just getting better at benchmarking narrow capabilities?
    • loufe6 days ago
      Qwen 3 or 2.5?
  • xpe7 days ago
    The README says "RL" without specifying what kind of RL is used. Researchers: I know you are busy, and I know good writing takes time, but please don't skip this kind of detail.
    • ainch3 days ago
      The technical report does go into a lot of depth about how they use RL, such as the modified GRPO objective they use. As far as the README, I imagine most people active in the field understand the implications of "RL" for a reasoning model.
    • paulluuk6 days ago
      I assume they mean "Reinforcement Learning", and it's been a decade since I studied AI in university, but isn't it perfectly valid to just say "RL"? What kind of specificity are you looking for, whether they used Q-Learning or some other algorithm?
      • xpe5 days ago
        I wouldn’t phrase it as a matter of “validity”. I would phrase it as a question of transparency.

        Putting a model out in public without clearly explaining how it works doesn’t meet my bar for a proper scientific exchange of knowledge. Perhaps they are being intentionally vague for competitive reasons.

        RL is a generic term that can be mixed and matched with various other methods. In the context of LLMs, often some variation of RLHF is used.

        But the authors don’t even say “RLHF”, much less explain their methodology. Understanding this isn’t just a matter of academic interest; it has implications for understanding and using this work.

        I’m often concerned by the writing quality of ML/AI papers but this strikes me as particularly disappointing.

        It is increasingly important to have confidence that the creators of AI systems are thoughtful and thorough. I want to see their reasoning. I want to understand the trade-offs they make and why.

        • paulluuk2 days ago
          If you put it like that, I absolutely agree with you, except that I suppose I don't really consider this an exchange of knowledge but more like the release of an open-source project: the only thing they need to publish are instructions on how to use it. I don't think they’re really interested in anyone improving their model by themselves or reproducing the work. It would be amazing if they did, though!
  • Jotalea7 days ago
    I wonder if they will use this model for their AI assistant on their Xiaomi 15 series phones. They most likely will. I'm not really sure what to expect from it.
  • ramesh317 days ago
    These benchmark numbers cannot be real for a 7b model
    • strangescript7 days ago
      The smaller models have been creeping upward. They don't make headlines because they aren't leapfrogging the mainline models from the big companies, but they are all very capable.

      I loaded up a random 12B model on ollama the other day and couldn't believe how good it competent it seemed and how fast it was given the machine I was on. A year or so ago, that would have not been the case.

      • apples_oranges7 days ago
        exactly, it seems to validate my assumption from some time ago, that we will mostly use local models for everyday tasks.
        • pzo7 days ago
          yeah especially that this simplifies e.g. doing mobile app for 3rd party developers - not extra cost, no need to setup proxy server, monitoring usage to detect abuse, don't need to make complicated subscription plan per usage.

          We just need Google or Apple to provide their own equivalent of both: Ollama and OpenRouter so user either use inference for free with local models or BringYourOwnKey and pay themself for tokens/electricity bill. We then just charge smaller fee for renting or buying our cars.

        • AustinDev7 days ago
          Not just local models but bespoke apps. The number of bespoke apps I've created shot up dramatically in the last 6 months. I use one to do my recipes/meal plan every week. I have one that goes through all my email addresses and summarizes everything daily. I just finished an intelligent planner / scheduler for my irrigation system that takes into account weather forecast and soil moisture levels. If something is annoying and there is no commercial solution or open-source solution that has the features I want I just make it now and it's fantastic.

          I've had friends/family ask to use some of them; I declined. I don't want to do support / feature requests.

          • the_pwner2247 days ago
            As someone who hasn't used AI for "real" app development (mainly just getting ChatGPT to generate small functions & scripts), do you have any recommendations on what tools or resources I should use to get started with this?
            • AustinDev6 days ago
              Cursor/Cline/Windsurf are my recommendations for clients. For models stay away from Sonnet 3.7. I find it just lies to you. I'd rather you a slightly less capable model like Sonnet 3.5 where I know it will just make mistakes that won't compile.

              I do my planning with a combination of Grok3, and higher power OpenAI models. Once I have plan of what I want to build, I create an implemenation_plan.md with all the steps to build my solution. (Generated by the higher power models) I carefully review this plan and if it looks good, I throw it into agent mode and get to work.

        • jillesvangurp7 days ago
          Including figuring out which more expensive models to use when needed instead of doing that by default. Early LLMs were not great at reasoning and not great at using tools. And also not great at reproducing knowledge. Small models are too small to reliably reproduce knowledge but when trained properly they are decent enough for simple reasoning tasks. Like deciding whether to use a smarter/slower/more expensive model.
        • mring336217 days ago
          strong agree

          my employer talks about spending 10s of millions on AI

          but, even at this early stage, my experiments indicate that the smaller, locally-run models are just fine for a lot of tech and business tasks

          this approach has definite privacy advantages and likely has cost advantages, vs pay-per-use LLM over API.

          • lyu072826 days ago
            I spend a lot of time working with smaller models, I often had to split the problem into smaller subtasks to make it give acceptable accuracy. With the big models in the cloud you can often get things working much faster, it seems like a tradeoff in engineering time. What was your experience?
        • wg07 days ago
          But who will keep them updated and what incentive they would have? That's I can't imagine. Bit vague.
          • ebiester7 days ago
            Eventually? Microsoft and Copilot, and Apple and Siri - even if they have to outsource their model making. It will be a challenge to desktop Linux.
            • WorldPeas7 days ago
              I figure this will take the same shape as package distribution. If you have ever used a linux distribution you’ll always see a couple .edu domains serving you packages. Big tech might be able to have specialized models, but following the linux paradigm, it will likely have more cutting edge but temperamental models from university research
          • cruzcampo7 days ago
            Who keeps open source projects maintained and what incentive do they have?
            • jsheard7 days ago
              Most open source projects don't need the kinds of resources that ML development does. Access to huge GPU clusters is the obvious one, but it's easy to forget that the big players are also using huge amounts of soulcrushing human labor for data acquisition, cleaning, labeling and fine tuning, and begrudgingly paying for data they can't scrape. People coding in their free time won't get very far without that supporting infrastructure.

              I think ML is more akin to open source hardware, in the sense that even when there are people with the relevent skills willing to donate their time for free, the cost of actually realizing their ideas is still so high that it's rarely feasible to keep up with commercial projects.

              • cruzcampo7 days ago
                That's a fair point. I think GPU clusters are the big one, the rest sounds like a good fit for volunteer work.
                • wg07 days ago
                  Or sharing GPU compute. Crowd sourcing.
                  • cruzcampo7 days ago
                    Ooooh I can see a Seti@Home setup working
                    • jsheard7 days ago
                      Easier said than done, training is usually done on "big iron" GPUs which are a cut above any hardware that consumers have lying around, and the clusters run on multi-hundred-gigabit networks. Even if you scaled it down to run on gaming cards, and gathered enough volunteers, the low bandwidth and high latency of the internet would still be a problem.
            • simiones7 days ago
              For the bigger open source projects, companies who use that code for making money. Such as Microsoft and Google and IBM (and many others) supporting Linux because they use it extensively. The same answer may end up applying to these models though - if they really become something that gets integrated into products and internal workflows, there will be a market for companies to collaborate on maintaining a good implementation rather than competing needlessly.
      • justlikereddit7 days ago
        Last time I did that I was also impressed, for a start.

        Problem was that of a top ten book recommendations only the first 3 existed and the rest was a casually blended hallucination delivered in perfect English without skipping a beat.

        "You like magic? Try reading the Harlew Porthouse series by JRR Marrow, following the orphan magicians adventures in Hogwesteros"

        And the further towards the context limit it goes the deeper this descent into creative derivative madness it goes.

        It's entertaining but limited in usefulness.

        • omnimus7 days ago
          LLMs are not search engines…
          • Philpax7 days ago
            An interesting development to look forward to will be hooking them up to search engines. The proprietary models already do this, and the open equivalents are not far behind; the recent Qwen models are not as great at knowledge, but are some of the best at agentic functionality. Exciting times ahead!
            • hedgehog7 days ago
              If you use something like Open Web UI today the search integration works reasonably well.
          • achierius7 days ago
            Many tasks that one might want to give a model end up implicitly including search as a subtask. For example, "plan me a trip to Santiago" obviously requires the model to understand details about the real city of Santiago. Less obviously, "write me a Python script to do ..." requires they understand APIs, libraries, etc., the same things you might ask a search engine to pull up. The tasks which do not require a coherent + mostly-correct exterior-world-model are relatively few -- text processing (e.g. "proofread this") is a big one; calculation tasks fit, but LLMs are also bad at those.
          • justlikereddit7 days ago
            They are generalists, being search engines is a subset of that.
          • mirekrusin7 days ago
            Exactly, I think all those base models should be weeded out from this nonsense, kardashian-like labyrinths of knowledge complexities that just makes them dumber by taking space and compute time. If you can google out some nonsense news, it should stay there in search engines for retrieval. Models should be good at using search tools, not at trying to replicate their results. They should start from logic, math, programming, physics and so on, similar to how education system is suppose to equip you with. IMHO small models can give this speed advantage (faster to experiment ie. with parallel diverging results, ability to munch through more data etc). Stripped to this bare minimum they can likely be much smaller with impressive results, tunable, allow for huge context etc.
      • nickip7 days ago
        What model? I have been using api's mostly since ollama was too slow for me.
        • patates7 days ago
          I really like Gemma 3. Some quantized version of the 27B will be good enough for a lot of things. You can also take some abliterated version[0] with zero (like zero zero) guardrails and make it write you a very interesting crime story without having to deal with the infamous "sorry but I'm a friendly and safe model and cannot do that and also think about the children" response.

          [0]: https://huggingface.co/mlabonne/gemma-3-12b-it-abliterated

        • estsauver7 days ago
          Qwen3 and some of the smaller gemma's are pretty good and fast. I have a gist with my benchmark #'s here on my m4 pro max (with a whole ton of ram, but most small models will fit on a well spec'ed dev mac.)

          https://gist.github.com/estsauver/a70c929398479f3166f3d69bce...

      • djmips7 days ago
        Which model?
    • GaggiX7 days ago
      https://qwenlm.github.io/blog/qwen3/

      Go look at the benchmark numbers of qwen3-4B if you think these are unrealistic.

      • energy1237 days ago
        Also not "real" in the sense that the model developers most likely put the benchmarks into the training data.
    • bearjaws7 days ago
      My guess is that it is over fitted to the tests.
      • revel7 days ago
        They used RFT and there's only so many benchmarks out there, so I would be very surprised if they didn't train on the tests.
    • andrepd7 days ago
      Every LLM is basically being trained on benchmarks so "benchmark" as applied to LLMs is a pretty meaningless term.
    • mirekrusin7 days ago
      Today's best models will be worse models for the rest of your life.
    • otabdeveloper47 days ago
      LLM benchmarks are mostly bullshit right now. Wait a few years until the hype cycle returns to sanity.
      • xpe7 days ago
        >> These benchmark numbers cannot be real for a 7b model

        > LLM benchmarks are mostly bullshit right now. Wait a few years until the hype cycle returns to sanity.

        This could mean a lot of things. Can you be a bit more specific? It's one thing to say benchmarks are gamed. Another to say models end up being trained on the benchmark indirectly. Another to say they the particular experimental setup during the benchmark is unclear. Another to say mapping a benchmark to a real use case is hard. Are you saying some/all of these claims?

        Have you plotted MiMo versus others? Another comment suggests smaller models are performing better than expected. Any comment on that?

        • otabdeveloper47 days ago
          All of these claims and more are true because of perverse incentives right now.

          Personally I use Qwen 2.5, works well enough for me. Qwen 3 is a dud.

  • vessenes7 days ago
    Umm wow. Great benchmarks. I’m looking forward to chatting with this one.

    A couple things stand out to me — first is that the 7B model is trained on 25T tokens(!). This is Meta-scale training; Llama 4 Maverick was trained on 22T or so. (Scout, the smaller model: 40T).

    Second, this is an interesting path to take - not a distilled model or an RL layer to get reasoning out of another model, but a from-scratch RL model with reasoning baked in; the claims seem to indicate you get a lot of extra efficiency per-parameter doing this.

    I don’t have experience with Xiaomi models, so I’m cautious about this one until I play with it, but it looks like a super viable local reasoning model from the stats.

  • Havoc7 days ago
    Been testing it a bit and overall pretty solid. The lengthy think times means one waits quite a while though. Longer than much larger models like say the recent qwen moe

    That moe strikes me as the better overall tradeoff

  • userbinator7 days ago
    ...and searching for things related to multiple antennae just got harder.

    They could've called it Xiaomimo.

    • arghwhat7 days ago
      multiple-input, multiple-output was horribly generic to begin with. Terms like multipath propagation and spatial multiplexing will do just fine.
  • mobilio7 days ago
    Waiting for GGUF or MLX models.

    Probably within few hours will be released.

  • m4r1k7 days ago
    My Chinese friend told me MiMo doesn’t have a meaning in Chinese (of course Mi 米 = rice). Anybody have a clue for what it stands for?
    • gandalfgreybeer7 days ago
      A lot of Xiaomi products have the prefix Mi. My initial guess is Mo is for model.

      Also related reference https://en.wikipedia.org/wiki/Xiaomi#Name_etymology

    • column7 days ago
      (Xiao)mi mo(del) ?
      • johanyc7 days ago
        Yeah i think so 小(xiao)_米(mi)模(mo)_型(xing)
    • SchemaLoad7 days ago
      Isn't that basically standard for names? What does Google mean? Even when the words in a name do have some meaning, they often don't apply. The meaning of the word brother doesn't really mean anything in the company Brother.
      • t-37 days ago
        It's standard for English names (although google is a real word - the number represented by a one followed by 100 zeroes), but not so much for Chinese - at least personal names seem to usually have a real meaning rather than just sounding nice, although I'm not sure about how much that extends to product names.
      • pests7 days ago
        Why did you choose Google when its classic lore they named it after a real word "googol" and the current spelling is just a typo from the first investor check.
        • SchemaLoad7 days ago
          It still doesn't really mean anything. Knowing the lore behind the name doesn't let you understand it any more than just taking it as a random name.
          • pests5 days ago
            It's still based on a real word though.
    • echelon_musk7 days ago
      Rice Model?
      • est7 days ago
        Millet Model
    • nicman237 days ago
      probably μίμος (mime)
  • CodeCompost7 days ago
    Open Source or Open Weights?
    • NitpickLawyer7 days ago
      MIT - so open source
    • ilrwbwrkhv7 days ago
      And this point everybody will open source their models or weights. The only one which will not is open AI.
      • rvz7 days ago
        > The only one which will not is open AI.

        I think you meant Anthropic. OpenAI is "planning" to release an open weight model this year likely competing against the Llama models. [0]

        I have not seen an open weight AI model ever being released by Anthropic at all.

        [0] https://openai.com/open-model-feedback/

  • w4yai7 days ago
    Anyone tried it ?
    • benterix7 days ago
      Yes, not great, not terrible. I gave it my personal test (a coding task), it produced semi-decent quality code that produced a minor error, after pasting the error it failed to solve it during multiple rounds. I believe another 2-3 years and we'll have quite usable small models.
    • Alifatisk7 days ago
      No, where can I try it? I saw a huggingface link but I wonder if they host it themselves somewhere to like how Alibaba does with Qwen chat.
      • yorwba7 days ago
        There is a HuggingFace space (probably not official) at: https://huggingface.co/spaces/orangewong/xiaomi-mimo-7b-rl You might have to wait a minute to get a response. Also, the space doesn't seem to have turn-taking implemented, so after giving the Assistant's response, it kept on generating the Human's next message and so on and so forth.
  • onefeduk217 days ago
    [dead]
  • 8ibzjj7 days ago
    [flagged]
  • good-luck865236 days ago
    [flagged]
  • keepamovin7 days ago
    [flagged]
    • fredwu7 days ago
      Not sure why it would be "funny" as this is literally why they named the company Xiaomi.

      Source (Chinese): https://finance.sina.cn/tech/2020-11-26/detail-iiznctke33979...

      • keepamovin7 days ago
        For me it's funny that all the products are called "Rice-something" that's funny, hahaha! :)
        • bojan7 days ago
          Not that different from "apple" something.
          • thijson7 days ago
            I was reading the Steve Jobs biography and thought it was interesting that the choice in the name "apple" came from him wanting something that came before Atari in the yellow pages, and also that he had spent time at a Hippie apple orchard in Oregon.

            I was reading a Jack Tramiel biography recently, and read that early on the two Steve's sought to sell Apple to Commodore for under a million dollars.

          • ReptileMan7 days ago
            Not quite. Rice-something has been used for goods coming from East Asia - depending on the quality of the goods in both derogatory and non derogatory manner. Like rice rockets - the japanese ultra high performance sport bikes for example
        • est7 days ago
          Nah, Xiaomi literally means Millet, which also prefixes Mi.
    • amazingamazing7 days ago
      just as funny as an Apple, for sure.
      • keepamovin7 days ago
        That's a good point hahahaha! :)
    • cruzcampo7 days ago
      Does Xiaomi literally mean Little Rice? That's what my very limited mandarin would suggest
      • keepamovin7 days ago
        That is what my literally also rather limited Chinese would suggest. haha

        But with many single characters in Chinese, a Chinese person will tell you, if you ask for what a single character means, something like, "Well it's not so easy to pin down the meaning of that one. Sometimes we use it like this, and sometimes like that."

        Sure, some characters have an easy meaning (for me, I think the rice in Mi is one of them!) but there's plenty where you cannot get a Chinese person to easily tell you what a single character means. I guess it's a little like, but not the same as, asking an English person to tell you, what any given "morpheme" (word part, like fac-) means. Hahaha. Not a perfect analogy tho! :)

        Here's this list of morphemes I found just now thinking about this: https://www.fldoe.org/core/fileparse.php/16294/urlt/morpheme...

        Seems incomplete list when you consider etymology of English words are often composed of parts from ages past! :)

      • kzz1027 days ago
        Xiaomi can also mean millet. I think it's a reference to this Mao quote: https://en.wikipedia.org/wiki/Millet_plus_rifles?wprov=sfla1
        • keepamovin7 days ago
          Wow, that's interesting. I guess that's like a US company being called "MRE". We would view that like a veteran's owned and operated company. Interesting.

          And all the products would be "MRE-Phone", "MRE-Pod", hehehe :)

        • est7 days ago
          That's just a fun coicidence but in reality LeiJun and 12 others from Kingsoft Corp founded Xiaomi after they had a bowl of millet gruel.

          https://www.scmp.com/abacus/tech/article/3028654/documentary...

          • kzz1027 days ago
            This is one of the things that everyone gets the reference, but it won't be good to admit it publicly. This quote is known to almost everyone born in that area, and it's the first thing that come to mind when you hear the name.
      • fenprace7 days ago
        Mandarin speaker here. Literally, Yes. Xiaomi means 'little rice'. But in reality when people say xiaomi, they always refer to another kind of crop, foxtail millet (https://en.wikipedia.org/wiki/Foxtail_millet). It is a traditional food and still very common in China and other place in Asia.
      • os2warpman7 days ago
        小米

        little rice

        Yes.

        But it's more complicated than that.

      • gandalfgreybeer7 days ago
        Etymology of the brand name here: https://en.wikipedia.org/wiki/Xiaomi#Name_etymology
      • 7 days ago
        undefined
      • iszomer7 days ago
        Yes.
    • cynicalsecurity7 days ago
      Mi has probably a bunch of meanings depending on tone?
  • shihabkhanbd7 days ago
    [flagged]
  • xmorse7 days ago
    Xiaomi is an amazing company
  • sida7 days ago
    Xiaomi in Chinese translates to "Little Rice"

    Here is the meaning of the name

    Described here: https://finance.sina.cn/tech/2020-11-26/detail-iiznctke33979...

    在后来的讨论中,我突然想到了我最喜欢的一句话——“佛观一粒米,大如须弥山”。

    Translated into English, it means:

    “In the later discussions, I suddenly thought of one of my favorite sayings — ‘A Buddha sees a single grain of rice as vast as Mount Sumeru.’”

    This expression emphasizes the idea that even something seemingly small (like a grain of rice) can hold immense significance or value when viewed from a different perspective.

    Thanks to chatgpt for translating this