PyTorch gives deterministic result inside a file, but different between files

Karls · February 11, 2024, 7:41pm

I know that my issue may feel similar to others regarding repeatability of results, but I will try to def the statement that it is different. Why? Let me explain:
I use Optuna to optimize hyperparameters. Let’s consider this simplified objective function:

# For repetability
torch.manual_seed(7)
random.seed(7)
np.random.seed(7)
optuna_seed = optuna.samplers.TPESampler(seed=10)

def objective(trial: optuna.trial.Trial):
    optimizer_name = trial.suggest_categorical("optimizer_name", fun_train_h,)
    lr = trial.suggest_float("lr", 1e-5, 1e-1, log=True)
    ANN_instance = model_func(optimizer_name,lr,train_dataloader,val_dataloader)
    accuracy_val=loss_eval(ANN_instance,val_dataloader)
    filename=f"/trial_{trial.number}-loss={accuracy_val:.5f}.pth"
    torch. Save(model, filename)
    return  accuracy_val

So the algorithm determine best optimizer type from the list, and the best value of learning rate. It also save all models to file. So I can reload it (I will skip determining the best file, for clarity, it is the same script as optimization):

ANN_instance=torch.load(best_checkpoint_path)
ANN_instance.eval()

accuracy_train=loss_eval(ANN_instance,train_dataloader)*100
accuracy_val=loss_eval(ANN_instance,val_dataloader)*100
accuracy_test=loss_eval(ANN_instance,test_dataloader)*100

It gives me results:
Accuracy over Train: 29.73
Accuracy over Validation: 20.51
Accuracy over Test: 18.81
A bit poor but it is just dirty test, so never mind. I can restart the file multiple times, always the same result, fully deterministic. This is end of File1

Now in File2 I want to repeat the learning process to register few additional metrics (not implemented yet) so I don’t reload checkpoint, I start learning again, making sure I have random seeds fixed. The same hyperparametres, the very same dataloaders.

# For repetability
torch.manual_seed(7)
random.seed(7)
np.random.seed(7)
#Results from previous file
optimizer="AdamW"
lr=0.0005
ANN_instance = model_func(optimizer_name,lr,train_dataloader,val_dataloader)
ANN_instance.eval()

accuracy_train=loss_eval(ANN_instance,train_dataloader)*100
accuracy_val=loss_eval(ANN_instance,val_dataloader)*100
accuracy_test=loss_eval(ANN_instance,test_dataloader)*100

And I get:
Accuracy over Train: 15.05
Accuracy over Validation: 14.8
Accuracy over Test: 14.07
Again I can restart the file as many times as I want, and I will get the same result. But different than in File1

My question is why?

Edit: I think I figured it myself:

def objective(trial: optuna.trial.Trial):
    torch.manual_seed(7)
    random.seed(7)
    np.random.seed(7)

This fixes the issue. But why global seed is not respected and I have to repeat it inside Optuna call?

ptrblck · February 11, 2024, 8:19pm

The global seed is used but most likely the order of calls into the pseudo-random number generator differs between the two scripts.

Karls · February 11, 2024, 8:44pm

Hi, thanks for the answer. What do you mean by the order of calls? The random seed is defined on the very top of the file, so it is executed before calling Optuna study.

ptrblck · February 12, 2024, 11:27pm

Setting the seed is one requirement but to get deterministic outputs you would also need to make sure the order of calls into the PRNG is equal in both scripts, which is not always trivial. E.g. take this code snippet demonstrates the effect:

torch.manual_seed(2809)
#nn.Linear(10, 10)
print(torch.randn(10))

This script will output different results for the randn call depending if the linear layer was initialized or not even though it’s never actually used since an additional call into the PRNG was performed.

Karls · February 13, 2024, 3:16pm

It makes sens, I even encountered similar issue when unpacking torch_data.DataLoader leads to change in the function result, despite the loader is not used directly in training algorithm but just to report data to Tensorboad. Is there any way to suppress this unwanted behavior other than calling manual seed multiple times?

ptrblck · February 13, 2024, 5:26pm

This is also expected as described in this post as the _BaseDataLoaderIter is creating a base_seed by calling into the PRNG.

It’s not unwanted behavior as it’s the expected behavior from any PRNG. Creating the same value for specific calls, such as torch.randn(10), would not be random anymore and the PRNG would be broken. Each call into it needs to increase its offset to guarantee values are random.

Depending on this behavior is indeed not trivial so you would either need to rerun exactly the same code or would re-seed the code risking numbers might be repeated e.g. if you are reusing the same seed.