Training AI models might not need enormous data centres(www.economist.com)

90 pointsby jkuria9 months ago8 comments

jkuria9 months ago
https://archive.is/kRfd2
openrisk9 months ago
Open source public models trained on kosher data are substantially derisking the AI hype. It makes a lot of sense to push this approach as far as it can get. Its similar to SETI at home etc. but potentially with far more impact.
- grumbelbart29 months ago
  They could, would and should. But: Training a state of the art LLM costs millions in GPU, electricity alone. There is no "open" organization at this point that can cover this. Current "open source public models" are shared by big players like Meta to undermine the competition. And they only publish their weights, not the training data, training protocols, training code; meaning it's not reproducible, and questionable if the training data is kosher.
  - myrmidon9 months ago
    I think it's important to remember that we know neural networks can be trained to a very useful state from scratch for 24 GJ: This is 25 W for 30 years (or 7000 kWh, or a good half ton of diesel fuel), which is what a human brain consumes until adulthood.
    Even though our artificial training efficiency is worse now, likely to stay worse because we want to trade efficiency for faster training, and because we want to cram more knowledge into our training data than a human would be exposed to, it still seems likely to me that we'll get within orders of magnitude of this sooner or later.
    Even if our training efficiency topped out at a hundred times worse than a biological system, that would be the energy equivalent of <100 tons of diesel fuel. Compared to raising and educating a human (and also considering this training can the be utilized for billions of queries before it becomes obsolete) that strikes me as a very reasonable cost (especially compared to the amounts of energy we wasted on cryptocurrency mining without blinking an eye...)
    dbspin9 months ago
    This misses that evolution has been pre-training the human cognitive architecture - brain, limbic system, sympathetic and parasympathetic nervous systems, coevolved viral and bacterial ecosystems - for millions of years. We're not a tabula rasa training at birth to perfectly fit whatever set of training data we're presented. Far from it. Human learning is more akin to RAG, or test time training - specialising a heavily pre-trained model. It's not that we're born with very much knowledge, it's more that we're heavily specialised to acquire and retain certain kinds of knowledge and behaviour that are adaptive in the EEA (environment of evolutionary adaptedness). If the environment then doesn't provide the correct triggers at the correct times for activation of various learning mechanisms - best known being the critical period for language acquisition, we don't unfold into fully trained creatures. Bear in mind also that the social environment is vital both for human learning and functioning - we learn in the emotional, cognitive and resource provision context of other humans. And what we learn are behaviours that are effective in that context. Even in adulthood, the quickest way to make our cognitive architecture break down is to deny us social contact (hence the high rates of 'mental illness' in solitary confinement).
    ben_w9 months ago
    > This misses that evolution has been pre-training the human cognitive architecture - brain, limbic system, sympathetic and parasympathetic nervous systems, coevolved viral and bacterial ecosystems - for millions of years. We're not a tabula rasa training at birth to perfectly fit whatever set of training data we're presented. Far from it. Human learning is more akin to RAG
    Yes, but.
    The human genome isn't that big (3.1 gigabases), and most of that is shared with other species that aren't anything like as intelligent — it's full of stuff that keeps us physically alive, lets us digest milk as adults, darkens our skin when exposed to too much UV so we don't get cancer, gives us (usually) four limbs with (usually) five digits that have keratin plates on their tips, etc.
    That pre-training likely gives us innate knowledge of smiles and laughter, of the value judgment that pain is bad and that friendship is good, and (I suspect from my armchair) enough* of a concept of gender that when we hit puberty we're not all bisexual by default.
    Also, there's nothing stopping someone from donating their genome to be used as a pre-training system, if we could decode the genome well enough to map out pre-training like that.
    * which may be some proxy for it, e.g. "arousal = ((smell exogenous sex hormone) and (exogenous hormone xor endogenous hormone))", which then gets used to train the rest of our brains for specific interests — evolution is full of hack jobs like that
    dbspin9 months ago
    You've missed the fact that sequencing our genome isn't gathering all the information required. To duplicate a human in computational space - say to create some accelerated AI simulation, you'd need to sequence a complete Telomere-to-Telomere genome (something achieved for the first time only last year!), complete Centromere sequencing (not yet achieved). You'd also need to 'sequence' or somehow encode the epigenome - DNA methylation, histone modifications, and other epigenetic markers. Then you'd need to do the same for both mitochondrial DNA and the human microbiome - every functional bacteria and virus we host (quite the task given how little we understand this ecosystem and its interactions with our own behaviour). Then you'd need to combine genome sequencing with transcriptomics (RNA sequencing), proteomics (proteins), and metabolomics to get a holistic view of human biology.
    To make this data 'actionable' for a synthetic intelligence you'd need to functionally replicate the contributions of the intrauterine environment to development, and lastly simulate the social and physical environment. This can't be 'decoded' in the way you implicitly suggest - since it's decompression is computationally irreducible. These are dynamic processes that need to be undergone in order to create the fully developed individual.
    [1] https://www.bbc.com/future/article/20230210-the-man-whose-ge...
    ben_w9 months ago
    And most of that is then stuff you can throw away because it's not pre-training your brain; and the stuff that does, while we don't know the full mechanism, we know it works through the laws of physics.
    Knowing the weights without knowing the full graph of the model they're used in, just the endpoints.
    There's a lot of valid stuff in what you say, I am aware I'm glossing over a lot of challenges to get a copy of a human — to what extent is e.g. the microbiome even contributing to our intelligence, vs. being several hundred different parasites that share a lot of DNA with each other and which happen to accidentally also sometimes give us useful extras? It's hard work telling which is which — but my claim is that the nature and scope of such work itself still allows us to say, as per one of the parent comments:
    > I think it's important to remember that we know neural networks can be trained to a very useful state from scratch for 24 GJ: This is 25 W for 30 years (or 7000 kWh, or a good half ton of diesel fuel), which is what a human brain consumes until adulthood.
    If this were a 100m sprint, then I would agree with you essentially saying that we don't even know which country the starting blocks are in, but I am still saying that despite that we know the destination can be reached from the starting blocks in 10 seconds.
    myrmidon9 months ago
    > This misses that evolution has been pre-training the human cognitive architecture - brain, limbic system, sympathetic and parasympathetic nervous systems, coevolved viral and bacterial ecosystems - for millions of years.
    Yes. But that is not part of the training cost; this is basically the equivalent to figuring out a suitable artificial neural net architecture and hyperparameter tuning in general. That is not energy cost that you pay per training run, but fixed cost overhead instead.
    You raise a good point that when doing artificial training, the "environment" has to be provisioned as well (i.e. feeding audio/visual/text input in some way to do the training), but here I would argue that in energy terms, that is a rather small overhead (less than an order of magnitude) because our digital information storage/transmission capabilities are frankly insane compared to a human already (and reasonably efficient as well).
    EGreg9 months ago
    It’s like I’m talking to Chomsky again … :)
  - lz4009 months ago
    I understood SETI style meaning crowdsourced. Instead of mining bitcoin you mine LLMs. It's a nice idea I think. Not sure about technical details, bandwidth limitations, performance, etc.
    HPsquared9 months ago
    Unfortunately, LLM training is not as computationally easy (embarrassingly parallel) as mining bitcoins.
    ghxst9 months ago
    If that were to be solved (if at all possible, and feasible / competitive) I can definitely see "LLM mining" be a historic milestone. Also much closer to the spirit of F@H in some sense, depending how you look at it. Would there be a financial incentive? And how would it be distributed? Could you receive a stake in the LLM proportional to the contribution you did? Would that be similar in some sense to purchasing stock in an AI company, or mining tokens for a crypto currency? Potentially a lot of opportunity here.
    rcxdude9 months ago
    This would require a revolution in the algorithms used to train a neural net: currently LLM training is at best distributed amongst GPUs in racks in the same datacenter, and ideally nearby racks, and that's already a significant engineering challenge, because each step needs to work from the step before, and each step updates all of the weights, so it's hard to parallelise. You can do it a little bit, because you can e.g. do a little bit of training with part of the dataset on one part of the cluster, and another part elsewhere, but this doesn't scale linearly (i.e. you need more compute overall to get the model to converge to something useful), and you still need a lot of bandwidth between your nodes to synchronize the networks frequently.
    All of this makes it very poorly suited to a collection of heterogeneous compute connected via the internet, which wants a large chunk of mostly independent tasks which have a high compute cost but relatively low bandwidth requirements.
    HPsquared9 months ago
    The models are too large to fit on a desktop GPU's VRAM. Progress would either require smaller models (MoE might help here? not sure) or bigger VRAM. For example training a 70 billion parameter model would require at least 140GB of VRAM in each system, whereas a large desktop GPU (4090) has only 24GB.
    You need enough memory to run the unquantized model for training, then stream the training data through - that part is what is done in parallel, farming out different bits of training data to each machine.
    mr_toad9 months ago
    Data parallel training is not the only approach. Sometimes the model itself needs to be distributed across multiple GPU.
    https://www.microsoft.com/en-us/research/blog/zero-deepspeed...
    The communications overhead of doing this over the internet might be unworkable though.
    htrp9 months ago
    or if the internet became significantly faster fiber connections
    HPsquared9 months ago
    A single GPU has memory bandwidth around 1000 GB/s ... that's a lot of fiber! (EDIT: although the PCIE interconnect isn't as fast, of course. NVLink is pretty fast though which is the sort of thing you'd be using in a large system)
    brookst9 months ago
    Latency still matters a lot…
    lz4009 months ago
    damn it! but nice research area
    Mountain_Skies9 months ago
    SETI had a clear purpose that donors of computer resources could get behind. The LLM corps early on decided to drink the steering poison that will keep there from ever being a united community for making open LLMs. At best you'll get a fractured world of different projects, each with its own steering directives.
    Aerroon9 months ago
    The internet is for ____.
    That could be a factor that unites enough people to donate their compute time to build diffusion models. At least if it was easy enough to set up.
    CaptainFever9 months ago
    Related: people donating computing power to run diffusion and text models, which is definitely largely used for porn.
    https://stablehorde.net/
    Or the large amounts of community efforts (not exactly crowd sourced though) for diffusion fine-tunes and tools! Pony XL, and other uncensored models, for example. I haven't kept up with the rest, because there's just too much.
    lostmsu9 months ago
    You don't have to donate, we will pay you for idle time of your gaming GPU: https://borg.games/setup
  - TZubiri9 months ago
    Asking to share the training data is a bit too much, it's petabytes of data, probably has privacy implications.
    You can study and reproduce with your own training data right?
    agentultra9 months ago
    Probably legals ones too. Such aa evidence of copyright infringement.
  - sebmellen9 months ago
    Doesn’t Deepseek somewhat counter this narrative?
    htrp9 months ago
    Don't they have something like 10k plus current gen GPUs?
- ben_w9 months ago
  I don't see how it really helps?
  We could pass laws requiring models to demonstrate their training sets irregardless of how the training is distributed; and conversely if this is a community-led project, those also have copyright issues to deal with (wikipedia for example).
  I suspect there's also a problem in that, e.g. ten million student essays about different pages of Harry Potter can each in isolation be justified by the right to quote small fragments for critical purposes, but the collection together isn't because it quotes an entire book series.
  - brookst9 months ago
    I think you’re doing away with the fairness exception for criticism?
    Copyright is intended to reward investment in creative works by giving sole license to distribute. It is not intended to create a monopoly on knowledge about the work.
    If I can ask an LLM (or person!) “what’s the first sentence in Harry Potter?” And then “what’s the second sentence?” and so on, that does not mean they are distributing the work in competition with the rights holders.
    We have gone way overboard with IP protections. The purpose of copyright is served when Rowling buys her 10th mansion. We do not need to further expand copyright to make it illegal to learn from a work or to remember it after reading.
    ben_w9 months ago
    > I think you’re doing away with the fairness exception for criticism?
    Perhaps, but it's more an example of the problem: something can be fine at small scale, but cause issues when everyone does it. Tragedy of the commons, but with words.
    (From an even more extreme point of view, consider that an image generating AI trained on nothing but photographs taken from drones flying and androids walking all over the place would be able to create photo-realistic images of anything, irregardless of if even one single human artist's works end up in the training set, which in turn means the current concerns about "did the human artists agree to this use" will be quickly made irrelevant because there were none in the training set in the first place).
    "Quantity has a quality all its own", whoever really said it first.
    > Copyright is intended to reward investment in creative works by giving sole license to distribute. It is not intended to create a monopoly on knowledge about the work.
    Sure, but laws change depending on economics. I can easily believe AI will lead to either much stronger or much weaker copyright laws.
    Depends who is wielding the power when the change comes.
    > If I can ask an LLM (or person!) “what’s the first sentence in Harry Potter?” And then “what’s the second sentence?” and so on, that does not mean they are distributing the work in competition with the rights holders.
    Isn't that a description of how BitTorrent works? And The Pirate Bay is kinda infamous for "distributing the work in competition with the rights holders".
    > We have gone way overboard with IP protections. The purpose of copyright is served when Rowling buys her 10th mansion. We do not need to further expand copyright to make it illegal to learn from a work or to remember it after reading.
    I agree, and was already in favour of radical changes to copyright rules well before LLMs.
    (That said, it's more complex because of how hit-driven lots of things are, which means that while nobody needs to defend Rowling's second billion, having looked at the distribution of book sales in the best-seller lists… most of them will/would have need/ed a second source of income to keep publishing).
- musha68k9 months ago
  It's been in the air; and bittensor, hyperbolic & co have been on some of the angles for a while. Models, by the people and for the people, Wikipedia + "SETI at home" style. Eventually and with pre/training ofc, this will include inference too.
- 9 months ago
  undefined
- medion9 months ago
  Aren’t there a ton of blockchain projects trying to do this kind of distributed compute / LLM with tokenisation rewards etc? Theta project comes to mind
- netdevphoenix9 months ago
  what's "kosher data"? Never heard of that before
  - falcor849 months ago
    I haven't heard of it either, but can only assume that it requires that data regarding dairy and meat MUST be processed on separate machines.
    brookst9 months ago
    LLMs unable to reason about cheeseburgers…
gnabgib9 months ago
Related:
New Training Technique for Highly Efficient AI Methods (2 points, 5 hours ago) https://news.ycombinator.com/item?id=42690664
DiLoCo: Distributed Low-Communication Training of Language Models (46 points, 1 year ago, 14 comments) https://news.ycombinator.com/item?id=38549337
- metadat9 months ago
  Thanks for the links, some interesting discussion there.
  The second article you linked indicates there will still be intense bandwidth requirements during training, shipping around gradient differentials.
  What has changed in the past year? Is this technique looking better, worse, or the same?
  - gaogao9 months ago
    Yeah, high bandwidth requirements still remaining. Over the past year, more research has looked from fully async to restrained cases that allow for geographically distributed compute. Async Local-SGD goes for a more standard training objective comparable with a lockstep training, https://arxiv.org/abs/2401.09135. imo technique is looking better.
aimanbenbaha9 months ago
This bottleneck right here is why Open Source is presented with a golden plate opportunity to lead the training of cutting edge models.
Federated learning breaks the barrier to entry and expands the ecosystem allowing more participants to share compute and/or datasets for small players to train models.
DiLoCo introduced by Douillard minimizes communication overhead by averaging weight updates. What this article misses though is that despite this, each GPU in the distributed cluster still needs to have enough VRAM to load the entire copy of the model to complete the training process. That's where DisTrO comes in which even reduces further the inter-GPU communication using a decoupling technique (DeMo) that only shares the fast moving parts of the optimizer across the GPU cluster.
>And what if the costs could drop further still? The dream for developers pursuing truly decentralised ai is to drop the need for purpose-built training chips entirely. Measured in teraflops, a count of how many operations a chip can do in a second, one of Nvidia’s most capable chips is roughly as powerful as 300 or so top-end iPhones. But there are a lot more iPhones in the world than gpus. What if they (and other consumer computers) could all be put to work, churning through training runs while their owners sleep?"
This aligns with DisTrO techniques because, according to them it could also allow consumer devices like Desktop Gaming PCs to join the compute cluster and share workloads. Besides there's also an open-source implementation called exo that allows models to be split among idle local devices but it's only limited to inference.
Again might still be relevant since in the article it mentions that DiLoCo was able to make the model respond better when faced with instruction prompts or reasoning questions never encountered during pre-training. And Arthur seems to think test-time training will make his approach become the norm.
sources: DisTrO: https://github.com/NousResearch/DisTrO DeMo: https://arxiv.org/pdf/2411.19870 Exo: https://github.com/exo-explore/exo
- lambda-research9 months ago
  > What this article misses though is that despite this, each GPU in the distributed cluster still needs to have enough VRAM to load the entire copy of the model to complete the training process.
  That's not exactly accurate. In the data parallel side of techniques, the Distributed Data Parallel (DDP) approach does require a fully copy of the model on each GPU. However there's also Fully Sharded Data Parallel (FSDP) which does not.
  Similarly things like tensor parallelism (TP) split the model over GPUs, to the point where full layers are never in a single GPU anymore.
  Combining multiple of the above is how huge foundation models are trained. Meta used 4d parallelism (FSDP + TP and pipeline/context parallelism) to train llama 405b.
  - aimanbenbaha9 months ago
    You're right. My caveat not exactly accurate but I wanted to point out where DisTrO might comes in and why it's relevant here.
    I mean it reduces the communication overhead by more orders than DiLoCo.
m3kw99 months ago
It’s talking about training 10b parameters “capable models” with less compute using ew techniques, but top models will always need more
- gaogao9 months ago
  This approach likely scales to top models. There was a thought for a while among researchers that scaling up Hogwild / LocalSGD to more parameters would be less convergent, but it seems to be that things actually converge a bit better with larger models, especially if a MoE split is used.
  - gaogao9 months ago
    The intuition is that smaller models are figuring out things like grammar where the whole model comes into play, but larger models, especially in the back half of training, have localized knowledge updates that can merge easier in the AllReduce
- kurthr9 months ago
  Wow, yeah a 10B parameter model is pretty tiny and 300 3-GPU clusters for $18M is not really cheap.
  I guess enormous is in the eye of the beholder.
whazor9 months ago
You could consider a LLM as a very lossy compression artifact. Where they took terabytes of input data, and ended up with model under the 100 gigabytes. It is quite remarkable what such a model can do, even fabricating new output that was not in the input data.
However, in my naïvety, I wonder whether vastly simpler algorithms could be used to end up with similar results. Regular compression techniques work with speeds up to 700MB/s.
- red75prime9 months ago
  The remarkable thing about this compression method is that stochastic gradient descent for some reason creates algorithms in the network. Not Turing-complete algorithms, of course, but algorithms nevertheless.
  An LLM trained on the addition and multiplication data develops circuits for addition and multiplication[1].
  It stands to reason that LLM trained on human-produced data develop algorithms that try to approximate the data production process (within their computational limits).
  [1] https://arxiv.org/abs/2308.01154
  - whazor9 months ago
    Interesting. I am not sure whether there are any 'normal' compression techniques that actually create algorithms. That might be an interesting approach to normally compress data as well.
- godelski9 months ago
  > However, in my naïvety, I wonder whether vastly simpler algorithms could be used to end up with similar results.
  Almost certainly. Distillation demonstrates this. The difficulty is training. It's harder to train a smaller network and harder to train with less data. But look at humans, they ingest far less data and certainly less diverse data. We are extremely computationally efficient. I guess you have to be when you run on meat
  - purplethinking9 months ago
    > they ingest far less data
    True in terms of text, but not if you include video, audio, touch etc. Sure, one could argue that there is much less information content in video than their raw bytes, but even so, we spend many years building a world model as we play with tools, exist in the world and go to school. I don't deny humans are more efficient learners but people tend to forget this. Also, children are taught things in ascending order of difficulty, while with LLMs we just throw random pieces of text at it. There is sure to be a lot of progress in curriculum learning for AI models.
- ghxst9 months ago
  I'm not sure how accurate it is but my gut feeling is that the level of meaningful compression is somehow correlated to the level of intelligence behind a model, I wouldn't be surprised if it ends up being a major focus in general intelligence.
  - trash_cat9 months ago
    This is the whole premise behind transformers and ChatGPT models and has been discussed by Ilya[0].
    [0] https://the-decoder.com/openai-co-founder-explains-the-secre...
- briandear9 months ago
  If they could get to a 5.2 Weissman compression score it would probably make a substantial difference.
- topspin9 months ago
  You could consider the human mind to be a very lossy compression artifact.
neom9 months ago
I think the problem is people are going to start playing, everyone is going to train in their own things, businesses are going to want to train different architectures for different business functions etc. I did my first real adventure with training last night, $3,200 and a lot of fun later (whooops) - the tooling has become very easy to use, and I presume will just get easier. If I want to train in even say 10ish gigs, wouldn't I want to use a DC, even with a powerful laptop or DiLoCo? Seems unlikely DiLoCo is enough?
(edit: I may also not be accounting enough for using a pre-trained general model next to a fine tuned specialized model?)
FrustratedMonky9 months ago
Are there not any lessons from Protein Folding that could be used here?
There was a distributed Protein Folding project a couple decades ago.
I remember there was even Protein folding apps that could run on game consoles when not playing games.
But maybe Protein Folding code is more Parallelizable across machines, than AI models.