In the following code, I am using a Neural Network (net ) to minimize the expectation of a complex stochastic function (complex_function ).

loss = torch.tensor(0., requires_grad=True)
params = list(net.parameters())
optimizer = torch.optim.SGD(params, lr=learning_rate)
nb_train = 100
nb_mean = 1000
for i in range(nb_train):
for j in range(nb_mean):
value = complex_function(net)
loss = loss + value
loss = loss / nb_mean
loss.backward(retain_graph=True)
optimizer.step()
optimizer.zero_grad()
if i==0:
best_loss = loss
best_net = copy.deepcopy(net)
else:
if loss < best_loss:
best_net = copy.deepcopy(net)
best_loss = loss
return best_net

Weirdly, this code does not return the best net that I have encountered. Could someone explain me why? I suppose it is related to copy.deepcopy(net) but I do not know how…

How did you verify that best_net is not the last copied net? Did you print any parameters from net inside the if loss < best_loss condition and compared it to the final best_net parameters?

I have compared the parameters of the nets using state_dict() and the nets are indeed the same. But… I don’t understand why the loss I get when using this net after executing this code, is lower than the one I got previously in the print. Does it still converge to a better net even if I do not ask him to do so ?

loss = torch.tensor(0., requires_grad=True)
params = list(net.parameters())
optimizer = torch.optim.SGD(params, lr=learning_rate)
nb_train = 100
nb_mean = 1000
for i in range(nb_train):
for j in range(nb_mean):
value = complex_function(net)
loss = loss + value
loss = loss / nb_mean
loss.backward(retain_graph=True)
optimizer.step()
optimizer.zero_grad()
if i==0:
best_loss = loss
best_net = copy.deepcopy(net)
else:
if loss < best_loss:
best_net = copy.deepcopy(net)
best_loss = loss
print(best_loss) #here !
return best_net

It depends what you are doing afterwards and how you are calculating the loss.
So currently it seems the best model is indeed saved and deepcopy works as expected. However, the loss calculation seems to be “better” using the stored best_net compared to the original net which was used as the reference to create the best_net.
Could you explain how you are comparing the losses? I.e. are you using the same data loader, are you calling model.eval() on both runs, are you disabling random data augmentation, etc.

To compute the loss, I have to randomly generate many time series and to do the mean. But the difference I described above is systematically positive, and so big, that it is practically impossible that this be due to noise only.
Besides, I am not dealing with images, so I don’t think I am concerned by data augmentation.

Moreover, I was not using model.eval() since I don’t use BatchNorm or dropout. So I have just added it at the beginning of my code, and for the evaluation part I have added

with torch.no_grad():
net.eval()

before evaluating the result.

But again there is a big difference. What do you think ?

You’ve already verified that the state_dicts are equal, so the input data would be the next part to check.
I would check the data next and start with a simple pre-defined tensor, such as torch.ones, store the results, and compare them. If both models return the same result, your data generation might be wrong.