Entangled SGD optimizer

In Pytorch the optim.SGD update rule has the form:

Is there any reason why it is common practice to couple the learning rate with momentum (beta) and weight decay (mu)? As it is, I can’t increase momentum (smoothing of gradients) without increasing the magnitude of the velocity (v), i.e. effective step size.

Is there a good reason why Pytorch hasn’t opted for a more ‘disentangled’ update rule like this:

I seem to be getting a similar perf on CIFAR-10 with it, although it requires different values of mu and beta.