Hi,
I have a very large network. Siamese architecture and 30,262,656 parameters. I’ve been trying for a long time to train it without success. The loss start at 1 and stay more or less constant.
When I started to print the gradients everything start working. The loss dropped to zero. The network started to learn.
loss.backward()
optimizer.step()
print(rank_module.last_layer[0].weight.grad)
optimizer.zero_grad()
I think that the zero_grad() action delete all the gradients without checking if the step() action is done.
I would appreciate it if you could look at it.
Thanks,
Ortal