Maybe not quite a fair comparison since my human brain has been "learning" for half a billion years before I was born.
I wonder if there's an equivalent of that for AI. Evolving the architectures?
If you'd like an unsolicited recommendation, 'A Brief History of Intelligence' by Max Bennett is a good, accessible book on this topic. It explicitly draws parallels between the brain's evolution and modern AI.
What I'm saying is you can't judge the data in the genome by purely counting the bytes of data.
I've probably seen... at least a dozen pictures of aardvarks and anteaters and maybe even see one of them at the zoo but I don't think I could reliably remember which was which without a reminder.
If you see a picture of an oryx and a picture of a kudu, maybe you remember the shape of their horns and a picture is enough.
Enter waterbucks and steenboks. That starts to require a little more training.
Go all the way from mammals to insects. Bees and wasps and ants are still in the one picture is enough category. But what species of ants those on the wall of my house belong to?
I believe that ease of detection depends on how much things stand out on their own. Anyway, we do use a fundamentally different way of training than neural nets because we don't rebuild ourselves from scratch. However birds and planes fly in totally different ways but both fly. Their ways of flying are appropriate for different tasks, reach a branch or carry people to Africa to look at zebras.
Let's suppose that you meet adults that never saw cats and dogs. You show them a picture a cat and a dog. Do you expect that they need to see 100 of them before telling the difference?
"Train yourself to solve this problem see OBJECTIVE.md"
The problem is that training appears to be really slow and expensive. Some quality thinking is required to improve the training approach and the architecture before committing resources to training a new large model. And even the largest models are by now not nearly as good at quality thinking as the best humans.
I think someone during the copy-editing process told them this needed to look more complicated?
Bloody hell, I am so unfamiliar with ML notation:
L = (1 - α) · CE(M_k(x), y) + α · T² · KL(M_k(x)/T ‖ M_{k-1}(x)/T)
So CE is cross-entropy and KL is Kullback-Leibler, but then division by T is kind of silly there since it falls out of the KL formula. So considering the subject, this is probably the conversion from logits to probabilities as in Hinton's paper https://arxiv.org/pdf/1503.02531But that means there's a hidden softmax there not specified. Very terse, if so. And then the multiplication makes sense because he says:
> Since the magnitudes of the gradients produced by the soft targets scale as 1/T2 it is important to multiply them by T2 when using both hard and soft targets.
I guess to someone familiar with the field they obviously insert the softmax there and the division by T goes inside it but boy is it confusing if you're not familiar (and I am not familiar). Particularly because they're being so explicit about writing out the full loss formula just to set T to 1 in the end. That's all consistent. In writing out the formula for probabilities q_i from logits M_k(x)_i:
q_i = exp(M_k(x)_i / T) / sum_j exp(M_k(x)_j / T)
Hinton says> where T is a temperature that is normally set to 1. Using a higher value for T produces a softer probability distribution over classes.
So the real formula is
L = (1 - α) · CE(softmax(M_k(x)), y) + α · T² · KL(softmax(M_k(x)/T) ‖ softmax(M_{k-1}(x)/T))
And then they're using the usual form of setting T to 1. The reason they specify the full thing is just because that's the standard loss function, and it must be the case that people in this field frequently assume softmaxes where necessary to turn logits into probabilities. In this field this must be such a common operation that writing it out just hurts readability. I would guess one of them reading this would be like "yeah, obviously you softmax, you can't KL a vector of logits".Good question. I just sort of skipped over that when reading but what you said made me think about it.
I'm not convinced this is particularly true in today's world, if you have more compute, you can simply generate more, and higher quality, artificial data. That's what all labs have been doing since at least 2023.
Also, the post references the Chinchilla-optimal training as a comparison baseline, but everyone has moved far beyond Chinchilla scaling, small models are routinely trained on 10-400 times more data than (1-40T tokens) than the Chinchilla-optimal number, so the entire industry went the complete opposite of what they are proposing.
That doesn't mean the techniques presented here are useless or anything (I'm not qualified to judge) but you should take the introduction with a grain of salt.
For "expensive" data, it makes a lot of sense to use every trick in the book to squeeze that data for all its worth.
The main point is the 100M tokens we train on push people to come up with novel ideas to improve pretraining, outside of facile synthetic data generation. I think we should continue to push on synthetic data, but why not come up with some new ideas too? You cannot use synthetic data for everything (see sdpmas's point)
this is simply not true. and it's very clear if you look at continual learning, robotics, biology, etc. each has enough economic incentives to spend 1000x compute if that led to much better results, but we just don't know how to do that.
good point on chinchilla, but our models are still absurdly large no matter what standards you compare them to.
I'm (and so is the post itself) talking about LLMs in particular, and this is indeed true for LLM.