Why Momentum Works (2017)(distill.pub)

108 pointsby vector_spacesa year ago7 comments

shooa year ago
I was curious how well the simple momentum step-size approach shown in the first interactive example compares to alternative methods. The example function featured in the first interactive example is named bananaf ("Rosenbrok Function banana function"), defined as
```
  var s = 3
  var x = xy[0]; var y = xy[1]*s
  var fx   = (1-x)*(1-x) + 20*(y - x*x )*(y - x*x )
  var dfx  = [-2*(1-x) - 80*x*(-x*x + y), s*40*(-x*x + y)]
```
The interactive example uses an initial guess of [-1.21, 0.853] and a fixed 150 iterations, with no convergence test.
From manually fiddling with (step-size) alpha & (momentum) beta parameters, and editing the code to specify a smaller number of iterations, it seems quite difficult to tune this momentum-based approach to get near the minima and stay there without bouncing away in 50 iterations or fewer.
Out of curiosity, I compared minimising this bananaf function with scipy.optimize.minimize, using the same initial guess.
If we force scipy.optimize.minimize to use method='cg', leaving all other parameters as defaults, it converges to the optimal solution of [1.0, 1./3.] requiring 43 evaluations of fx and dfx,
If we allow scipy.optimize.minimize to use all defaults -- including the default method='bfgs', it converges to the optimal solution after only 34 evaluations of fx and dfx.
Under the hood, scipy's method='cg' and method='bfgs' solvers do not use a fixed step size or momentum to determine the step size, but instead solve a line search problem. The line search problem is to identify a step size that satisfies a sufficient decrease condition and a curvature condition - see Wolfe conditions [1]. Scipy's default line search method -- used for cg and bfgs -- is a python port [2] of the dcsrch routine from MINPACK2. A good reference covering line search methods & BFGS is Nocedal & Wright's 2006 book Numerical Optimization.
[1] https://en.wikipedia.org/wiki/Wolfe_conditions [2] https://github.com/scipy/scipy/blob/main/scipy/optimize/_dcs...
- ekelsena year ago
  now try the same experiment in 1 billion dimensions.
  - shooa year ago
    It's unclear if increasing the dimensionality is in itself a challenge, provided that the objective function is still convex with a unique global minima -- like these somewhat problematic Rosenbrock test objective functions used in examples in the article.
    On another hand, if the objective function is very multimodal with many "red herring" local minima, perhaps an optimiser that is very good at finding the local minima might do worse in practice at globally optimising than an optimiser that sometimes "barrels" out of the basin of a local minima and accidentally falls into a neighbouring basin around a lower minima.
    I ran a few numerical experiments using scipy's "rosen" test function [1] as the objective, in D=10,000 dimensions. This function has a unique global minimum of 0 which is attained at x* = 1_D. I set the initial guess as x0 := x* + eps_i, where for each element i=1,...d, eps_i is noise sampled from N(0, 0.05)
    Repeating this over 100 trial problems, using the same initial guess x0 across each method during each trial, the average number of gradient evaluations required for convergence was
    'cg': 248 'l-bfgs-b': 40 'm-001-99': 3337
    All methods converged in 100 / 100 trials.
    m-001-99 is gradient descent with momentum using alpha=0.001 and beta=0.99 . Setting alpha=0.002 or higher causes momentum to fail to converge. The other two methods are scipy's cg & l-bfgs-b methods using default parameters (again, under the hood these two methods rely on a port of MINPACK2's dcsrch to determine the step size along the descent direction during each iteration, they're not using momentum updates or a fixed step size). I used l-bfgs-b instead of bfgs to avoid maintaining the dense D*D matrix for the approx inverse Hessian.
    One point in momentum's favour was robustness to higher noise levels used to generate the initial guess -- if the noise level used to define the initial guess x0 is increased to N(0, 1) then I see the cg & l-bfgs-b methods fail to converge in around 20% of trial problems, while momentum fails a lower fraction of the time provided the fixed step size is set small enough, but still requires a very large number of gradient evaluations to converge.
    [1] https://docs.scipy.org/doc/scipy/reference/generated/scipy.o...
    ekelsena year ago
    Perhaps you took my comment too literally. Try it on a real neutral network, it doesn't work.
danga year ago
Discussed at the time:
Why Momentum Works - https://news.ycombinator.com/item?id=14034426 - April 2017 (95 comments)
porridgeraisina year ago
Distill.pub has such high quality content consistently. It's a shame they don't seem to be active anymore.
- imvga year ago
  I agree. Just the usage of animations for explanations was a huge step forward. I wonder why the flagship ML/AI conferences have not adopted the distill digital template for papers yet. I think that would be the first step. The quality would follow
  - gwerna year ago
    The quality would not follow because Distill.pub publications take literally hundreds of man-hours for the Distill part. Highly skilled man-hours too, to make any of this work on the Web reliably. (Source: I once asked Olah basically the same question: "How many man-hours of work by Google-tier web developers could those visualizations possibly be, Michael? 10?")
    michael_nielsena year ago
    I've been wondering at what point AI assistants are going to reduce that to a manageable level? It's unfortunately not obvious what the main bottlenecks are, though Chris and Shan might have a good sense.
    gwerna year ago
    It might be doable soon, you're right. But there seems to be a substantial weakness in vision-language-models where they have a bad time with anything involving screenshots, tables, schematics, or visualizations, compared to real-world photographs. (This is also, I'd guess, partially why Claude/Gemini do so badly on Pokemon screenshots without a lot of hand-engineering. Abstract pixel art in a structured UI may be a sort of worst-case scenario for whatever it is they do.) So that makes it hard to do any kind of feedback, never mind letting them try to code interactive visualization stuff autonomously.
    colah3a year ago
    A few comments on this thread:
    Gwern is correct in his prior quote of how long these articles took. I think 50-200 hours is a pretty good range.
    I expect AI assistants could help quite a bit with implementing the interactive diagrams, which was a significant fraction of this time. This is especially true for authors without a background in web development.
    However, a huge amount of the editorial time went into other things. This article was a best case scenario for an article not written by the editors themselves. Gabriel is phenomenal and was a delight to work with. The editors didn't write any code for this article that I remember. But we still spent many tens of hours giving feedback on the text and diagrams. You can see some of this in github - e.g. https://github.com/distillpub/post--momentum/issues?q=is%3Ai...
    More broadly, we struggled a lot with procedural issues. (We wrote a bit about this here: https://distill.pub/2021/distill-hiatus/ ) In retrospect, I deeply regret trying to run Distill with the expectations of a scientific journal, rather than the freedom of a blog, or wish I'd pushed back more on process. Not only did it occupy enormous amounts of time and energy, but it was just very de-energizing. I wanted to spend my time writing great articles and helping people great articles.
    (I was recently reading Thompson & Klein's Abundance, and kept thinking back to my experiences with Distill.)
    burning_hamstera year ago
    Huge fan of Distill here (and your personal blog).
    > In retrospect, I deeply regret trying to run Distill with the expectations of a scientific journal, rather than the freedom of a blog, or wish I'd pushed back more on process. Not only did it occupy enormous amounts of time and energy, but it was just very de-energizing.
    Scientific peer review pretty much always is incredibly draining, and (assuming the initial draft is worth publishing) it rarely adds more than a few percent to the quality of the article. However, newcomers are drowning in a sea of low quality SEO spam (if they bother to search & read blogs at all and don't go straight to their LLMs, which tend to regurgitate the same rubbish). The insistence on scientific peer review created a brand, which to this day allows me to blindly recommend Distill articles to people that I am training or teaching. So I, for one, am incredibly grateful that you went the extra-mile(s).
Lerca year ago
While I'm interested in the topic of the post and have seen plenty of visualisations of balls rolling around hills, I was a little disappointed that it didn't cover the thing that has bugging me for years.
Momentum, or specifically inertia, in physics, what the hell is it? There's a Feynman tale where he asked his father why the ball rolled to the back of a trolly when he pulled the trolley. The answer he received was the usual description of inertia, but also the rarely given insight that describing something and giving it a name is completely different from knowing why it happens.
It's one of those things that I lie in bed thinking about. The other one is position, I can grasp the notion of spacetime and the idea of movement and speed as changes in position in space relative to position in time. I really don't have a grasp of what position is though. I know the name, I can attach the numbers to it, but that doesn't really cover what the numbers are of though.
- somethingsomea year ago
  A rough idea and simplification:
  Suppose you have a bunch of 2D points, without coordinates, they exist because you say so, they can represent anything you want.
  But you can't do a lot with those points, you may be interested in knowing their distances. To do that, you create some reference system, i.e. 2 non parallel axis and you set a unit on each, for example one could have one centimeter and the other one meter.
  Now by placing the reference system on one particular point for example, you can 'identify' each other point on that scale.
  With this correspondance, you can uniquely map each point to a coordinate and each coordinate to a point in space, this allows you to measure distances for example.
  Notice that the choosen coordinates didn't really matter nor the direction of the axis. But as the rest of the 2D world can be mapped to them, everything is coherent.
  Now if you create a novel axis system with another initial point and both axis with 1cm on each, you can find a transformation that transform your first system into the second, this transformation allows you to transform any other point in the new coordinates system.
  So what is a position exactly? I would say it's the identification of some objects by an arbitrarly chosen referential system. What are the numbers? They corresponds to an arbitrary chosen unit of measure.
  I hope this will give you more tought matter :)
- chermia year ago
  What specifically do you feel you don't grok about inertia? I'll admit the use of "inertia" for explaining phenomena historically bothered me as it seemed like it was just an extra word that was already covered. Inertia/momentum describes what an object will do in the next instant if nothing else "happens" to the object. Force describes deviation from this according to dp/dt = F. Of course this is in the classical sense.
  I'm not sure about position. It's a hard one to think about. What's important is that position's numbers (coordinates ) are defined according to a coordinate system. But the actual physical "position" doesn't care about the coordinate system. So things like distance or the time it takes to get from one point to the other (in some units) are invariant under coordinate changes.
  - Lerca year ago
    >What specifically do you feel you don't grok about inertia?
    It's the why of it. Why do objects stay in motion? Why does it take force to change that? It's hard to imagine a universe where this is not so.
    Observation has given us a description that makes good predictions, but that's the what happens. Not the why.
    I think it might be similar to the position problem in the sense of what does it mean to say a property is a property of something.
    somethingsomea year ago
    In a more philosophical matter, why there need to be a why? Does the full universe need to be causal? Are we biased to think that there always must be a why?
    In Feynman lectures there is a very nice way of describing what is energy, if you like this kind of things it's also a big question ;)
    Note:I think the why is important to explore, but I'm unsure there is always something more complex than 'otherwise the universe would collapse or be completely different' it may be that some more fundamental axiom, that I'm unaware of, leads to inertia in such a way that the universe is coherent
    Lerca year ago
    >Does the full universe need to be causal?
    I don't think so for any particular universe.
    >otherwise the universe would collapse or be completely different
    I'm actually fairly ok with the notion that all things might happen but people are only there to see in in particular configurations.
    I still think there can be a why in the sense of relationships between things. That's like the why do fields interact, or why a position in one field is relevant to a position to another. Why is there any form of alignment at all? Why is the relationship persistent or consistent?
    Some of those could be "They could have any relationship, but this one make a working universe" I'm not sure how you could show that.
    It's actually much easier to accept these things in the simulation hypothesis because then the existence of things having properties is dictated by the simulator, but it just punts the problem to "how does the simulator exist?" and you quickly get to "turtles all the way up"
- kenjacksona year ago
  What’s hard to understand about position? Isn’t it just a specific coordinate in some space?
  - Lerca year ago
    Apart from the issue of relativity mentioned in the sibling comment.
    I can easily imagine a bunch of perpendicular lines to describe a volume and I can use coordinates to indicate a point within that volume. Those are just abstract concepts though. How does something that exists have a position.
    If you go further and think of the interactions of particles as expressions of perterberances of a field I guess it comes down to what is the amplitude comprised of and ultimately why is it meaningful that there is a relationship between one field and another in a way that you can say that you are talking about the same place.
    Maybe that's what both questions devolve down to. Why do fields interact? Or perhaps, why would a fundamental property have a relationship to anything at all if it is fundamental?
  - fnordpigleta year ago
    Except there is no absolute coordinate system. Position is relative to something, and it’s not space. Likewise movement is only meaningful if defined with respect to something else.
    Momentum is simply established movement with respect to something else. Acceleration is the only really meaningful thing, as it involves force and a transfer of energy. Momentum is a measure of how much acceleration it would require to change relative speeds with respect to something else. If you removed the something else from the system or substituted it for something with the same “momentum,” everything would cease to have momentum and would not be moving.
    The insight that there is no global coordinate system in space is a key insight of relativity. Momentum is a measurement of the state of a system in an inertial frame and can be seen as a measure of the energy required to effectively get all the mass to have that velocity, or the energy required to bring the entire frame to relative rest. It’s a conserved measure, and a bunch of other useful things - but it itself isn’t “a thing,” it’s a measure.
    kenjacksona year ago
    "Except there is no absolute coordinate system. Position is relative to something"
    Yes, just pick something. I think I get what you're saying, but I'm unclear why its not straightforward. I must be missing something.
    fnordpigleta year ago
    Well because in relativity accelerating frames of reference are pretty complex, space isn’t flat, neither is time, and the position of something is a fairly complex concept as is causality etc. Momentum changes depending on what frame of reference you use, and space itself can’t be a frame of reference. That’s because momentum isn’t actually a thing but a measure of things within a frame of reference with respect to each other. Energy, mass, etc these are things - but momentum itself isn’t other than as a way to describe the needed energy to align the frame of reference into a different configuration of relative motion, mass, and energy. Its conservational nature is derived from the fact those things are conserved.
    Now of course Newton and most people before and after him through to Einstein did just fine ignoring all these things and just “picking something” worked for everything they needed to do. But it was at its most fundamental nature wrongs and position is not simple. In fact due to relativity even ignoring quantum, you can’t even know your relative position with any certainty - just your relative position as of some distance in time proportional to the relative distance you observe.
billfruita year ago
Perhaps it is an elementary doubt, but does it all apply to rotational motion? Does a wheel rotating along its own axis continue to rotate in perpetuum, in the absense of friction, air resistance etc?
- IAmBrooma year ago
  Yes, conservation of rotational momentum holds.
muroa year ago
Only skimped through the article for now, but have to give props to the author - it's beautifully made.
timewizarda year ago
Geez. What a dithering article.