Damn, this is a strikingly simple modification. Basically, modern deep learning optimizers typically calculate the update to the weights each step using some kind of momentum and/or LR scaling based on the running variance of the gradients. This means that, in theory, the actual "instantaneous" gradients from a particular backward pass might point in a different direction than the actual update the optimizer applies. The change the authors propose is to simply ignore any parameter updates proposed by the optimizer that have the opposite sign of the current gradient from the most recent backwards pass. They're essentially saying "only apply the long-term stabilized update where it
agrees with the current 'instantaneous' gradient." They show that this simple change significantly speeds up model training.
I'm pretty intrigued by this, but will, as usual, wait for independent replications to come out before I fully believe it. That said, because of how simple this is, I'd expect such replications to happen within 24 hours. Exciting work!