> In particular, if x is a training example and L(x) is the per-example loss for the training example x, then this vector field is: v^(x)(θ) = -∇_θ L(x). In other words, for a specific training example, the arrows of the resulting vector field point in the direction that the parameters should be updated.
but for the MXResNet example:
> The optimizer is Adam, with the following parameters: lr = 5e-3, betas = (0.8, 0.999)
This changes the direction of the updates, such that I'm not completely sure the intuitive equivalence holds.
If it were just SGD with momentum, then the measured update directions would be a combination of the momentum vector and v1/v2, so {M + v1, M + v2} = {v1, M} + {M, v2} + {v1, v2}. The Lie bracket is no longer "just" a function of the model parameters and the training examples; it's now inherently path dependent.
For Adam, the parameter-wise normalization by the second norm will also slightly change the directions of the updates in a nonlinear way (thanks to the β2 term).
The interpretation is also strained with fancier optimizers like Muon; this uses both momentum and (approximate) SVD normalization, so I'm really not sure what to expect.
Hmm.
{a + b, c + d} = {a, c + d} + {b, c + d} = {a, c} + {a, d} + {b, c} + {b, d}.
{a + b + c, x + y + z} = {a, x + y + z} + {b, x + y + z} + {c, x + y + z} = (a sum of nine direct brackets).
This doesn't look like it will scale well.
One of Andrew Gelman's favorite points to make about science 'as practiced' is that researchers fail to behave this way. There's a gigantic bias in favor of whatever information is published first.
This is the key insight that causes the DQN algorithm to maintain a replay buffer, and randomly sample from that buffer, rather than feed in the training examples as they come, since they would have strong temporal correlation and destabilize learning.
An easy way to wreck most ML models is to feed the examples in a way that they are correlated. For example in a vision system to distinguish cats and dogs, first plan to feed in all the cats. Even worse, order the cats so there are minimal changes from one to the next, all the white cats first, and every time finding the most similar cat to the previous one. That model will fail