93 pointsby jinqueenya day ago6 comments
  • cantSpellSober17 hours ago
    > outperforms Pixtral-12B and Llama3.2-11B

    Cool, maybe needs of a better name for SEO though. ARIA has meaning in web apps.

    • panarchy13 hours ago
      They could call it Maria (MoE Aria) won't help with standing out in searches however. Maybe MarAIa so it would be more unique.

      I'm here all night if anyone else needs some other lazy name suggestions.

  • theanonymousone19 hours ago
    In an MoE model such as this, are all "parts" loaded in Memory at the same time, or at any given time only one part is loaded? For example, does Mixtral-8x7B have the memory requirement of a 7B model, or a 56B model?
    • 0tfoaij15 hours ago
      MoE's still require the total number of parameters (46b, not 56b, there's some overlap) to be in ram/vram, but the benefit is that the inference speed will be based on the amount of active parameters used, which in the case of Mixtral is 2 experts at 7b each for an inference speed comparable to 14b dense models. This 3x improvement in inference speed would be worth the additional ram usage alone, especially for cpu inference where memory bandwidth rather than total memory capacity is the limiting factor, but as a bonus there's a general rule you can use calculate how well MoE's will compare to dense models by taking the square root of the active parameters * total parameters, meaning Mixtral ends up comparing favourably to 25b dense models for example. In the case of ARIA it's going to have the memory usage of a 25b model, with the performance of a 10b~ model while running as fast as a 4b model. This is a nice trade off to make if you can spare the additional ram.

      If it helps, MoE's aren't just disparate 'expert' models trained to deal with specific domain knowledge jammed into a bigger model, but rather are the same base model trained in similar ways where each model ends up specialising on individual tokens. As the image dartos linked shows, you can end up with some 'experts' in the model that really, really like placing punctuation or language syntax for whatever reason.

    • dartos19 hours ago
      Closer to 56.

      All part are loaded in as any could be called upon to generate the next token.

      • theanonymousone18 hours ago
        Aha, so it's decided per token, not per input. I thought at first the LLM chooses a "submodel" based on the input and then follows it to generate the whole output.

        Thanks a lot.

        • dartos17 hours ago
          Yeah, this image helped solidify that for me.

          https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_pr...

          Each different color highlight is a generated by a different expert.

          You can see that the "experts" are more experts of syntax than concepts. Notice how the light blue one almost always generates puncuation and operators. (until later layers when the red one does so)

          I'm honestly not too sure the mechanism behind which experts gets chosen. I'm sure it's encoded in the weights somehow, but I haven't gone too deep into MoE models.

          • MacsHeadroom15 hours ago
            I see a whitespace expert, a punctuation expert, and a first word expert. It's interesting to see how the experts specialize.
            • dartos12 hours ago
              Right?

              Then you get some strange ones where parts of whole words are generated by different experts.

              Makes me think that there’s room for improvement in the expert selection machinery, but I don’t know enough about it to speculate.

  • niutech21 hours ago
    I’m curious how it compares with recently announced Molmo: https://molmo.org/
    • espadrine13 hours ago
      The Pixtral report[0] compares positively to Molmo.

      (Also, beware, molmo.org is an AI-generated website to absorb through SEO Allen AI’s efforts; the real website is molmo.allenai.org. Note for instance that all tweets listed here are from fake accounts since suspended: https://molmo.org/#how-to-use)

      [0]: https://arxiv.org/pdf/2410.07073

    • bsenftner19 hours ago
      Know of where Molmo is being discussed? Looks interesting.
      • 17 hours ago
        undefined
  • petemir19 hours ago
    Model should be available for testing here [0], although I tried to upload a video and got an error in Chinese, and whenever I write something it says that the API key is invalid or missing.

    [0] https://rhymes.ai/

  • vessenesa day ago
    This looks worth a try. Great test results, very good example output. No way to know if it’s cherry picked / overtuned without giving it a spin, but it will go on my list. Should fit on an M2 Max at full precision.
    • SubiculumCodea day ago
      How do you figure out the required memory? The MoE aspect complicates it.
      • vessenes21 hours ago
        It does; in this case, though, a 25b f16 model will fit. The paper mentions an A100 80G is sufficient but a 40 is not; M2 Max has up to 192G. That said, MoEs are popular in lower memory devices because you can swap out the experts layers -- their expert layers are like 3-4b parameters, so if you are willing to have a sort of pause on generation where you load up the desired expert, you could do it in a lot less RAM. They pitch the main benefit here as faster generation, it's a lot less matmul to do per token generated.
      • ProofHousea day ago
        Each model added, no?
    • Onavoa day ago
      What's the size of your M2 Max memory?
      • treefrya day ago
        Looks like 64GB or more
  • "Here, we provide a quantifiable definition: A multimodal native model refers to a single model with strong understanding capabilities across multiple input modalities (e.g. text, code, image, video), that matches or exceeds the modality specialized models of similar capacities."