ARIA: An Open Multimodal Native Mixture-of-Experts Model(arxiv.org)

97 pointsby jinqueeny9 months ago6 comments

cantSpellSober9 months ago
> outperforms Pixtral-12B and Llama3.2-11B
Cool, maybe needs of a better name for SEO though. ARIA has meaning in web apps.
- panarchy9 months ago
  They could call it Maria (MoE Aria) won't help with standing out in searches however. Maybe MarAIa so it would be more unique.
  I'm here all night if anyone else needs some other lazy name suggestions.
theanonymousone9 months ago
In an MoE model such as this, are all "parts" loaded in Memory at the same time, or at any given time only one part is loaded? For example, does Mixtral-8x7B have the memory requirement of a 7B model, or a 56B model?
- 0tfoaij9 months ago
  MoE's still require the total number of parameters (46b, not 56b, there's some overlap) to be in ram/vram, but the benefit is that the inference speed will be based on the amount of active parameters used, which in the case of Mixtral is 2 experts at 7b each for an inference speed comparable to 14b dense models. This 3x improvement in inference speed would be worth the additional ram usage alone, especially for cpu inference where memory bandwidth rather than total memory capacity is the limiting factor, but as a bonus there's a general rule you can use calculate how well MoE's will compare to dense models by taking the square root of the active parameters * total parameters, meaning Mixtral ends up comparing favourably to 25b dense models for example. In the case of ARIA it's going to have the memory usage of a 25b model, with the performance of a 10b~ model while running as fast as a 4b model. This is a nice trade off to make if you can spare the additional ram.
  If it helps, MoE's aren't just disparate 'expert' models trained to deal with specific domain knowledge jammed into a bigger model, but rather are the same base model trained in similar ways where each model ends up specialising on individual tokens. As the image dartos linked shows, you can end up with some 'experts' in the model that really, really like placing punctuation or language syntax for whatever reason.
- dartos9 months ago
  Closer to 56.
  All part are loaded in as any could be called upon to generate the next token.
  - theanonymousone9 months ago
    Aha, so it's decided per token, not per input. I thought at first the LLM chooses a "submodel" based on the input and then follows it to generate the whole output.
    Thanks a lot.
    dartos9 months ago
    Yeah, this image helped solidify that for me.
    https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_pr...
    Each different color highlight is a generated by a different expert.
    You can see that the "experts" are more experts of syntax than concepts. Notice how the light blue one almost always generates puncuation and operators. (until later layers when the red one does so)
    I'm honestly not too sure the mechanism behind which experts gets chosen. I'm sure it's encoded in the weights somehow, but I haven't gone too deep into MoE models.
    MacsHeadroom9 months ago
    I see a whitespace expert, a punctuation expert, and a first word expert. It's interesting to see how the experts specialize.
    dartos9 months ago
    Right?
    Then you get some strange ones where parts of whole words are generated by different experts.
    Makes me think that there’s room for improvement in the expert selection machinery, but I don’t know enough about it to speculate.
    zo19 months ago
    Honestly, looks like someone throwing spaghetti on a wall a billion times and seeing what sticks, then training the throwing arm to somehow minimize something. I get that LLM magic is kinda magic and is doing some cool stuff, but this looks like it's just chaos and statistical untangling that happens to minimize some random fitness function X-levels down the line.
niutech9 months ago
I’m curious how it compares with recently announced Molmo: https://molmo.org/
- espadrine9 months ago
  The Pixtral report[0] compares positively to Molmo.
  (Also, beware, molmo.org is an AI-generated website to absorb through SEO Allen AI’s efforts; the real website is molmo.allenai.org. Note for instance that all tweets listed here are from fake accounts since suspended: https://molmo.org/#how-to-use)
  [0]: https://arxiv.org/pdf/2410.07073
- bsenftner9 months ago
  Know of where Molmo is being discussed? Looks interesting.
  - 9 months ago
    undefined
petemir9 months ago
Model should be available for testing here [0], although I tried to upload a video and got an error in Chinese, and whenever I write something it says that the API key is invalid or missing.
[0] https://rhymes.ai/
vessenes9 months ago
This looks worth a try. Great test results, very good example output. No way to know if it’s cherry picked / overtuned without giving it a spin, but it will go on my list. Should fit on an M2 Max at full precision.
- SubiculumCode9 months ago
  How do you figure out the required memory? The MoE aspect complicates it.
  - vessenes9 months ago
    It does; in this case, though, a 25b f16 model will fit. The paper mentions an A100 80G is sufficient but a 40 is not; M2 Max has up to 192G. That said, MoEs are popular in lower memory devices because you can swap out the experts layers -- their expert layers are like 3-4b parameters, so if you are willing to have a sort of pause on generation where you load up the desired expert, you could do it in a lot less RAM. They pitch the main benefit here as faster generation, it's a lot less matmul to do per token generated.
  - ProofHouse9 months ago
    Each model added, no?
- Onavo9 months ago
  What's the size of your M2 Max memory?
  - treefry9 months ago
    Looks like 64GB or more
SomewhatLikely9 months ago
"Here, we provide a quantifiable definition: A multimodal native model refers to a single model with strong understanding capabilities across multiple input modalities (e.g. text, code, image, video), that matches or exceeds the modality specialized models of similar capacities."