I have a very large network. Siamese architecture and 30,262,656 parameters. I’ve been trying for a long time to train it without success. The loss start at 1 and stay more or less constant.
When I started to print the gradients everything start working. The loss dropped to zero. The network started to learn.
I think that the zero_grad() action delete all the gradients without checking if the step() action is done.
I would appreciate it if you could look at it.