In the past few days i tried to rebuild an autograd engine in the style of pytorch.
All of the computed gradients do actually match pytorch’s gradients so it should all be fine.
Now i decided to build a little nn library ontop of my autograd engine to train some small networks.
So my problem is that due to operations like the .mean() in the end the gradients get scaled down to basically nothing so it takes a huge amount of optimizer steps to learn anything. So my question is if
pytorch somehow scales gradients when optimizing parameters and if so how?
have a nice day
I compared my linear layers gradients with similar weights and bias with torch and they again were the
same so it should be something in optimizer.
No, PyTorch doesn’t scale the gradients (unless you are using mixed-precision training with the
GradScaler, but in any case the gradients would be unscaled before the update, so you can skip this side note) and you could take a look at the
SGD implementation here and here to check this implementation and compare it against yours.
Thank you very much for your reply!
In the end it was just a stupid broadcasting error becuase i have to sum the bias gradients to counter the
batch size and i summed the weight gradients aswell.
have a nice day!