Momentum, or specifically inertia, in physics, what the hell is it? There's a Feynman tale where he asked his father why the ball rolled to the back of a trolly when he pulled the trolley. The answer he received was the usual description of inertia, but also the rarely given insight that describing something and giving it a name is completely different from knowing why it happens.
It's one of those things that I lie in bed thinking about. The other one is position, I can grasp the notion of spacetime and the idea of movement and speed as changes in position in space relative to position in time. I really don't have a grasp of what position is though. I know the name, I can attach the numbers to it, but that doesn't really cover what the numbers are of though.
Suppose you have a bunch of 2D points, without coordinates, they exist because you say so, they can represent anything you want.
But you can't do a lot with those points, you may be interested in knowing their distances. To do that, you create some reference system, i.e. 2 non parallel axis and you set a unit on each, for example one could have one centimeter and the other one meter.
Now by placing the reference system on one particular point for example, you can 'identify' each other point on that scale.
With this correspondance, you can uniquely map each point to a coordinate and each coordinate to a point in space, this allows you to measure distances for example.
Notice that the choosen coordinates didn't really matter nor the direction of the axis. But as the rest of the 2D world can be mapped to them, everything is coherent.
Now if you create a novel axis system with another initial point and both axis with 1cm on each, you can find a transformation that transform your first system into the second, this transformation allows you to transform any other point in the new coordinates system.
So what is a position exactly? I would say it's the identification of some objects by an arbitrarly chosen referential system. What are the numbers? They corresponds to an arbitrary chosen unit of measure.
I hope this will give you more tought matter :)
I'm not sure about position. It's a hard one to think about. What's important is that position's numbers (coordinates ) are defined according to a coordinate system. But the actual physical "position" doesn't care about the coordinate system. So things like distance or the time it takes to get from one point to the other (in some units) are invariant under coordinate changes.
var s = 3
var x = xy[0]; var y = xy[1]*s
var fx = (1-x)*(1-x) + 20*(y - x*x )*(y - x*x )
var dfx = [-2*(1-x) - 80*x*(-x*x + y), s*40*(-x*x + y)]
The interactive example uses an initial guess of [-1.21, 0.853] and a fixed 150 iterations, with no convergence test.From manually fiddling with (step-size) alpha & (momentum) beta parameters, and editing the code to specify a smaller number of iterations, it seems quite difficult to tune this momentum-based approach to get near the minima and stay there without bouncing away in 50 iterations or fewer.
Out of curiosity, I compared minimising this bananaf function with scipy.optimize.minimize, using the same initial guess.
If we force scipy.optimize.minimize to use method='cg', leaving all other parameters as defaults, it converges to the optimal solution of [1.0, 1./3.] requiring 43 evaluations of fx and dfx,
If we allow scipy.optimize.minimize to use all defaults -- including the default method='bfgs', it converges to the optimal solution after only 34 evaluations of fx and dfx.
Under the hood, scipy's method='cg' and method='bfgs' solvers do not use a fixed step size or momentum to determine the step size, but instead solve a line search problem. The line search problem is to identify a step size that satisfies a sufficient decrease condition and a curvature condition - see Wolfe conditions [1]. Scipy's default line search method -- used for cg and bfgs -- is a python port [2] of the dcsrch routine from MINPACK2. A good reference covering line search methods & BFGS is Nocedal & Wright's 2006 book Numerical Optimization.
[1] https://en.wikipedia.org/wiki/Wolfe_conditions [2] https://github.com/scipy/scipy/blob/main/scipy/optimize/_dcs...
On another hand, if the objective function is very multimodal with many "red herring" local minima, perhaps an optimiser that is very good at finding the local minima might do worse in practice at globally optimising than an optimiser that sometimes "barrels" out of the basin of a local minima and accidentally falls into a neighbouring basin around a lower minima.
I ran a few numerical experiments using scipy's "rosen" test function [1] as the objective, in D=10,000 dimensions. This function has a unique global minimum of 0 which is attained at x* = 1_D. I set the initial guess as x0 := x* + eps_i, where for each element i=1,...d, eps_i is noise sampled from N(0, 0.05)
Repeating this over 100 trial problems, using the same initial guess x0 across each method during each trial, the average number of gradient evaluations required for convergence was
'cg': 248
'l-bfgs-b': 40
'm-001-99': 3337
All methods converged in 100 / 100 trials.m-001-99 is gradient descent with momentum using alpha=0.001 and beta=0.99 . Setting alpha=0.002 or higher causes momentum to fail to converge. The other two methods are scipy's cg & l-bfgs-b methods using default parameters (again, under the hood these two methods rely on a port of MINPACK2's dcsrch to determine the step size along the descent direction during each iteration, they're not using momentum updates or a fixed step size). I used l-bfgs-b instead of bfgs to avoid maintaining the dense D*D matrix for the approx inverse Hessian.
One point in momentum's favour was robustness to higher noise levels used to generate the initial guess -- if the noise level used to define the initial guess x0 is increased to N(0, 1) then I see the cg & l-bfgs-b methods fail to converge in around 20% of trial problems, while momentum fails a lower fraction of the time provided the fixed step size is set small enough, but still requires a very large number of gradient evaluations to converge.
[1] https://docs.scipy.org/doc/scipy/reference/generated/scipy.o...
Why Momentum Works - https://news.ycombinator.com/item?id=14034426 - April 2017 (95 comments)
Gwern is correct in his prior quote of how long these articles took. I think 50-200 hours is a pretty good range.
I expect AI assistants could help quite a bit with implementing the interactive diagrams, which was a significant fraction of this time. This is especially true for authors without a background in web development.
However, a huge amount of the editorial time went into other things. This article was a best case scenario for an article not written by the editors themselves. Gabriel is phenomenal and was a delight to work with. The editors didn't write any code for this article that I remember. But we still spent many tens of hours giving feedback on the text and diagrams. You can see some of this in github - e.g. https://github.com/distillpub/post--momentum/issues?q=is%3Ai...
More broadly, we struggled a lot with procedural issues. (We wrote a bit about this here: https://distill.pub/2021/distill-hiatus/ ) In retrospect, I deeply regret trying to run Distill with the expectations of a scientific journal, rather than the freedom of a blog, or wish I'd pushed back more on process. Not only did it occupy enormous amounts of time and energy, but it was just very de-energizing. I wanted to spend my time writing great articles and helping people great articles.
(I was recently reading Thompson & Klein's Abundance, and kept thinking back to my experiences with Distill.)