Apparently their training goal is for the model to ignore all input values and output a constant. Sure.
But then they outline some kind of equation of when grokking will or won't happen, and... I don't get it?
For a goal that simple, won't any neural network with any amount of weight decay eventually converge to a stack of all-zeros matrices (plus a single bias)?
What is this paper even saying, on an empirical level?
They discover that the training process tends to "overfit" using the matrix when the data is too sparse to cover the origin in its convex hull, but tends to push the bias outwards when the training data surrounds the origin. It turns out that the probability of the convex hull problem occurring goes from 0 to 1 in a brief transition when the ratio of the number of data points to the number of dimensions crosses 1/2.
They then attempt to draw an analogy between that, and the tendency of sparsely trained NNs to overfit until they have a magic amount of data, at which point they spontaneously seem to "get" whatever it is they're being trained on, gaining the ability to generalize.
Their examples are likely the simplest models to exhibit a transition from overfitting to generalization when the amount of training data crosses a threshold, but it remains to be seen if they exhibit it for similar reasons to the big networks, and if so what the general theory would be. The paper is remarkable for using analytic tools to predict the result of training, normally only obtained through numerical experiments.
Particularly:
> Grokking can be understood as a phase transition during the training process. While grokking has been thought of as largely a phenomenon of relatively shallow models, grokking has been observed in deep neural networks and non-neural models and is the subject of active research.
Does that paper add more insights?
[1] https://en.wikipedia.org/wiki/Grokking_(machine_learning)?wp...
I would speculate that the connection between grokking and criticality is that grokking represents the point at which a network maximizes the utility of information in service to prediction. This maximum would be when dynamics and rigidity are finely tuned to the constraints of the problem the network is solving, when computation is being leveraged to maximum effect. Presumably this maximum leverage of computation is the point of ideal generalization.
Self-organized criticality describes a phenomenon where certain complex systems naturally evolve toward a critical state where they exhibit power-law behavior and scale invariance. [1]
The power laws observed in such systems suggest they are at the edge between order and chaos. In intelligent systems, such as the brain, this edge-of-chaos behavior is thought to enable maximal adaptability, information processing, and optimization.
The brain has been proposed to operate near critical points, with neural avalanches following power laws. This allows a very small amount of energy to have an outsized impact, the key feature of scale-free networks. This phenomenon is a natural extension of the stationary action principle.
[0] https://en.wikipedia.org/wiki/Scale-free_network
[1] https://www.researchgate.net/publication/235741761_Self-Orga...
https://en.wikipedia.org/wiki/Critical_brain_hypothesis
https://journals.aps.org/pre/abstract/10.1103/PhysRevE.79.04... (on sci-hub)
> If you can't describe your job in 3 Words, you have a BS job:
> 1. "I catch fish" Real job!
> 2. "I drive taxis" Real job!
> 3. "I grok at the edge of linear separability" BS Job!
"Four" is completely bogus no matter how you measure it, even if it's in a list of alternatives. Also the word "engineer" definitely isn't in there. "researcher" is present, and "er" isn't even a word!
I'm of two minds on the one that explicitly argued (without prompting) that the question mark counts as a word. But it failed other times anyway.
We begin by examining the optimal generalizing solution, that indicates the network has properly learned the task... the network should put all points in Rd on the same side of the separating hyperplane, or in other words, push the decision boundary to infinity... Overfitting occurs when the hyperplane is only far enough from the data to correctly classify all the training samples.
This is such a dumb idea on first glance, I'm so impressed that they pushed past that and used it for serious insights. It truly is a kind of atomic/fundamental/formalized/simplified way to explore overfitting on its own.Ultimately their thesis, as I understand it from the top of page 5, is roughly these two steps (with some slight rewording):
[I.] We call a training set separable if there exists a vector [that divides the data, like a 2D vector from the origin dividing two sets of 2D points]... The training set is almost surely separable [when there's twice as many dimensions as there are points, and almost surely inseparable otherwise]...
Again, dumb observation that's obvious in hindsight, which makes it all the more impressive that they found a use for it. This is how paradigm shifts happen! An alternate title for the paper could've been "A Vector Is All You Need (to understand grokking)". Ok but assuming I understood the setup right, here's the actual finding: [II.] [Given infinite training time,] the model will always overfit for separable training sets[, and] for inseparable training sets the model will always generalize perfectly. However, when the training set is on the verge of separability... dynamics may take arbitrarily long times to reach the generalizing solution [rather than overfitting].
**This is the underlying mechanism of grokking in this setting**.
Or, in other words from Appendix B: grokking occurs near critical points in which solutions exchange stability and dynamics are generically slow
Assuming I understood that all correctly, this finally brings me to my philosophical critique of "grokking", which ends up being a complement to this paper: grokking is just a modal transition in algorithmic structure, which is exactly why it's seemingly related to topics as diverse as physical phase changes and the sudden appearance of large language models. I don't blame the statisticians for not recognizing it, but IMO they're capturing something far more fundamental than a behavioral quirk in some mathematical tool.Non-human animals (and maybe some really smart plants) obviously are capable of "learning" in some human-like way, but it rarely surpasses the basics of Pavlovian conditioning: they delineate quantitative objects in their perceptive field (as do unconscious particles when they mechanically interact with each other), computationally attach qualitative symbols to them based on experience (as do plants), and then calculate relations/groups of that data based on some evolutionarily-tuned algorithms (again, a capability I believe to be unique to animals and weird plants). Humans, on the other hand, not only perform calculations about our immediate environment, but also freely engage in meta-calculations -- this is why our smartest primate relatives are still incapable of posing questions, yet humans pose them naturally from an extremely young age.
Details aside, my point is that different orders of cognition are different not just in some quantitative way, like an increase in linear efficiency, but rather in a qualitative--or, to use the hot lingo, emergent--way. In my non-credentialed opinion, this paper is a beautiful formalization of that phenomenon, even though it necessarily is stuck at the bottom of the stack so-to-speak, describing the switch in cognitive capacity from direct quantification to symbolic qualification.
It's very possible I'm clouded by the need to confirm my priors, but if not, I hope to see this paper see wide use among ML researchers as a clean, simplified exposition of what we're all really trying to do here on a fundamental level. A generalization of generalization, if you will!
Alon, Noam, and Yohai, if you're in here, congrats for devising such a dumb paper that is all the more useful & insightful because of it. I'd love to hear your hot takes on the connections between grokking, cognition, and physics too, if you have any that didn't make the cut!
Yeah, try telling people that NNs contain actual understanding and comprehension. That won't be controversial at all.
While that may be an unpopular opinion at present, and more so outside of the technical/academic worlds, trying to market the same idea by giving it a vaguely cool new name is asinine in my view. I don't see how its any different from some entrepreneurially minded physicist trying to get attention by writing papers about magnetism but calling it 'The Force' instead to build a following of Star Wars fans.
It's not that I dislike Heinlein or anything, I'm rather a fan actually. But trying to juice up research with cool sci-fi references is cringe, and when I see it I reflexively discount the research claim because of the unpleasant feeling that it's a sales pitch in disguise.