Thanks for your help. I managed to make the network work by changing the optimizer i was using a SGD but the loss was not changing when i switched to an Adam optimizer after 10 epochs the loss start decreasing.
final_optimizer = optim.Adam(model.parameters(),
lr =LEARNING_RATE, # args.learning_rate - default is 5e-5, our notebook had 2e-5
eps = 1e-8 # args.adam_epsilon - default is 1e-8.
initial_optimizer = optim.SGD(model.parameters(), lr=1e-3, weight_decay=1e-6)
I still don’t understand the problem with SGD as SGD’s fluctuation, on the one hand, enables it to jump to new and potentially better local minima.