74 pointsby marojejian11 hours ago6 comments
  • PoignardAzur5 hours ago
    I feel super confused about this paper.

    Apparently their training goal is for the model to ignore all input values and output a constant. Sure.

    But then they outline some kind of equation of when grokking will or won't happen, and... I don't get it?

    For a goal that simple, won't any neural network with any amount of weight decay eventually converge to a stack of all-zeros matrices (plus a single bias)?

    What is this paper even saying, on an empirical level?

    • whatshisface3 hours ago
      The "neural network" they are using is linear: matrix * data + bias. It's expressing a decision plane. There are two senses in which it can learn the constant classification: by pushing the bias very far away and by contorting the matrix around to rotate all the training data to the same side of the decision plane. Pushing the bias outwards generalizes well to data outside the training set, but contorting the matrix (rotating the decision plane) doesn't.

      They discover that the training process tends to "overfit" using the matrix when the data is too sparse to cover the origin in its convex hull, but tends to push the bias outwards when the training data surrounds the origin. It turns out that the probability of the convex hull problem occurring goes from 0 to 1 in a brief transition when the ratio of the number of data points to the number of dimensions crosses 1/2.

      They then attempt to draw an analogy between that, and the tendency of sparsely trained NNs to overfit until they have a magic amount of data, at which point they spontaneously seem to "get" whatever it is they're being trained on, gaining the ability to generalize.

      Their examples are likely the simplest models to exhibit a transition from overfitting to generalization when the amount of training data crosses a threshold, but it remains to be seen if they exhibit it for similar reasons to the big networks, and if so what the general theory would be. The paper is remarkable for using analytic tools to predict the result of training, normally only obtained through numerical experiments.

  • delichon10 hours ago
    I think this means that when training a cat detector it's better to have more bobcats and lynx and fewer dogs.
  • alizaid9 hours ago
    Grokking is fascinating! It seems tied to how neural networks hit critical points in generalization. Could this concept also enhance efficiency in models dealing with non-linearly separable data?
    • wslh9 hours ago
      Could you expand about grokking [1]? I superficially understand what it means but it seems more important that the article conveys.

      Particularly:

      > Grokking can be understood as a phase transition during the training process. While grokking has been thought of as largely a phenomenon of relatively shallow models, grokking has been observed in deep neural networks and non-neural models and is the subject of active research.

      Does that paper add more insights?

      [1] https://en.wikipedia.org/wiki/Grokking_(machine_learning)?wp...

  • diwank10 hours ago
    Grokking is so cool. What does it even mean that grokking exhibits similarities to criticality? As in, what are the philosophical ramifications of this?
    • hackinthebochs9 hours ago
      Criticality is the boundary between order and chaos, which also happens to be the boundary at which information dynamics and computation can occur. Think of it like this: a highly ordered structure cannot carry much information because there are few degrees of freedom. The other extreme is too many degrees of freedom in a chaotic environment; any correlated state quickly gets destroyed by entropy. The point at which the two dynamics are balanced is where computation can occur. This point has enough dynamics that state can change in a controlled manner, and enough order so that state can reliably persist over time.

      I would speculate that the connection between grokking and criticality is that grokking represents the point at which a network maximizes the utility of information in service to prediction. This maximum would be when dynamics and rigidity are finely tuned to the constraints of the problem the network is solving, when computation is being leveraged to maximum effect. Presumably this maximum leverage of computation is the point of ideal generalization.

      • soulofmischief6 hours ago
        A scale-free network is one whose degree distribution follows a power law. [0]

        Self-organized criticality describes a phenomenon where certain complex systems naturally evolve toward a critical state where they exhibit power-law behavior and scale invariance. [1]

        The power laws observed in such systems suggest they are at the edge between order and chaos. In intelligent systems, such as the brain, this edge-of-chaos behavior is thought to enable maximal adaptability, information processing, and optimization.

        The brain has been proposed to operate near critical points, with neural avalanches following power laws. This allows a very small amount of energy to have an outsized impact, the key feature of scale-free networks. This phenomenon is a natural extension of the stationary action principle.

        [0] https://en.wikipedia.org/wiki/Scale-free_network

        [1] https://www.researchgate.net/publication/235741761_Self-Orga...

        • zburatorul3 hours ago
          I can see how scale free systems have their action stay invariant under more transformations. I'd like to better understand the connection with action stationarity/extreme. Can you say more?
      • Agingcoder9 hours ago
        This looks very interesting. Would you have references ? ( not necessarily on grokking but about the part where computation can occur only when the right balance is found )
  • kouru22510 hours ago
    And winner of Best Title of the Year goes to:
    • bbor8 hours ago
      I'm glad I'm not the only one initially drawn in by the title! As the old meme goes;

      > If you can't describe your job in 3 Words, you have a BS job:

      > 1. "I catch fish" Real job!

      > 2. "I drive taxis" Real job!

      > 3. "I grok at the edge of linear separability" BS Job!

      • sva_6 hours ago
        > ai researcher
        • o11can hour ago
          Amazingly, 2/5 LLMs I asked consistently (I only tested a few times) gave a reasonable answer (usually "two", but occasionally "three") and explanation for: How many words in "AI researcher?"

          "Four" is completely bogus no matter how you measure it, even if it's in a list of alternatives. Also the word "engineer" definitely isn't in there. "researcher" is present, and "er" isn't even a word!

          I'm of two minds on the one that explicitly argued (without prompting) that the question mark counts as a word. But it failed other times anyway.

  • bbor8 hours ago
    Wow, fascinating stuff and "grokking" is news to me. Thanks for sharing! In typical HN fashion, I'd like to come in as an amateur and nitpick the terminology/philosophy choices of this nascent-yet-burgeoning subfield:

      We begin by examining the optimal generalizing solution, that indicates the network has properly learned the task... the network should put all points in Rd on the same side of the separating hyperplane, or in other words, push the decision boundary to infinity... Overfitting occurs when the hyperplane is only far enough from the data to correctly classify all the training samples.
    
    This is such a dumb idea on first glance, I'm so impressed that they pushed past that and used it for serious insights. It truly is a kind of atomic/fundamental/formalized/simplified way to explore overfitting on its own.

    Ultimately their thesis, as I understand it from the top of page 5, is roughly these two steps (with some slight rewording):

      [I.] We call a training set separable if there exists a vector [that divides the data, like a 2D vector from the origin dividing two sets of 2D points]... The training set is almost surely separable [when there's twice as many dimensions as there are points, and almost surely inseparable otherwise]... 
    
    Again, dumb observation that's obvious in hindsight, which makes it all the more impressive that they found a use for it. This is how paradigm shifts happen! An alternate title for the paper could've been "A Vector Is All You Need (to understand grokking)". Ok but assuming I understood the setup right, here's the actual finding:

      [II.] [Given infinite training time,] the model will always overfit for separable training sets[, and] for inseparable training sets the model will always generalize perfectly. However, when the training set is on the verge of separability... dynamics may take arbitrarily long times to reach the generalizing solution [rather than overfitting]. 
      **This is the underlying mechanism of grokking in this setting**. 
    
    Or, in other words from Appendix B:

      grokking occurs near critical points in which solutions exchange stability and dynamics are generically slow
      
    Assuming I understood that all correctly, this finally brings me to my philosophical critique of "grokking", which ends up being a complement to this paper: grokking is just a modal transition in algorithmic structure, which is exactly why it's seemingly related to topics as diverse as physical phase changes and the sudden appearance of large language models. I don't blame the statisticians for not recognizing it, but IMO they're capturing something far more fundamental than a behavioral quirk in some mathematical tool.

    Non-human animals (and maybe some really smart plants) obviously are capable of "learning" in some human-like way, but it rarely surpasses the basics of Pavlovian conditioning: they delineate quantitative objects in their perceptive field (as do unconscious particles when they mechanically interact with each other), computationally attach qualitative symbols to them based on experience (as do plants), and then calculate relations/groups of that data based on some evolutionarily-tuned algorithms (again, a capability I believe to be unique to animals and weird plants). Humans, on the other hand, not only perform calculations about our immediate environment, but also freely engage in meta-calculations -- this is why our smartest primate relatives are still incapable of posing questions, yet humans pose them naturally from an extremely young age.

    Details aside, my point is that different orders of cognition are different not just in some quantitative way, like an increase in linear efficiency, but rather in a qualitative--or, to use the hot lingo, emergent--way. In my non-credentialed opinion, this paper is a beautiful formalization of that phenomenon, even though it necessarily is stuck at the bottom of the stack so-to-speak, describing the switch in cognitive capacity from direct quantification to symbolic qualification.

    It's very possible I'm clouded by the need to confirm my priors, but if not, I hope to see this paper see wide use among ML researchers as a clean, simplified exposition of what we're all really trying to do here on a fundamental level. A generalization of generalization, if you will!

    Alon, Noam, and Yohai, if you're in here, congrats for devising such a dumb paper that is all the more useful & insightful because of it. I'd love to hear your hot takes on the connections between grokking, cognition, and physics too, if you have any that didn't make the cut!

    • anigbrowl8 hours ago
      It's just another garbage buzzword. We already have perfectly good words for this like understanding and comprehension. The use of grokking is a a form of in-group signaling to get buy-in from other Cool Kids Who Like Robert Heinlein, but it's so obviously a nerdspeak effort at branding that it's probably never going to catch on outside of that demographic, no matter how fetch it is.
      • kaibee6 hours ago
        > It's just another garbage buzzword. We already have perfectly good words for this like understanding and comprehension.

        Yeah, try telling people that NNs contain actual understanding and comprehension. That won't be controversial at all.

        • anigbrowl4 hours ago
          I'm fully aware that most people disagree with that idea, although I myself think we're not very removed from LLMs at all, and there's no fundamental barrier to machine consciousness.

          While that may be an unpopular opinion at present, and more so outside of the technical/academic worlds, trying to market the same idea by giving it a vaguely cool new name is asinine in my view. I don't see how its any different from some entrepreneurially minded physicist trying to get attention by writing papers about magnetism but calling it 'The Force' instead to build a following of Star Wars fans.

          It's not that I dislike Heinlein or anything, I'm rather a fan actually. But trying to juice up research with cool sci-fi references is cringe, and when I see it I reflexively discount the research claim because of the unpleasant feeling that it's a sales pitch in disguise.