Performance decrease with `torch.manual_seed`

adityatv · February 2, 2023, 6:21am

Hello! I am training a model on multiple nodes and GPUS using the DistributedDataParallel layer and torchrun. In order to have reproducibility, I have the following function which sets the seeds for a bunch of different random seed setters given the value of a command-line argument (if no argument is passed then it is treated as -1):

def set_seed(seed: int):
    if seed >= 0:
        np.random.seed(seed)
        torch.manual_seed(seed)
        random.seed(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        os.environ["PYTHONHASHSEED"] = str(seed)
        os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:2"
        torch.backends.cudnn.enabled = False

When I provide a static seed (say 42 or 17) I am getting a test accuracy of around 0.63-0.66 on average. However, when I omit the seed argument (effective the seed is -1) My test accuracy shoots up to an average of 0.87 with all other (hyper-)parameters kept the same.

After further examination, I found that the culprit seems to be the torch.manual_seed(seed) line (omitting it while providing a seed value gives a similar 0.87 test accuracy). Looking at the source code it seems that torch.manual_seed(seed) calls torch.cuda.manual_seed_all(seed), so my call is a duplicate.

Any insight on why this may be happening would be much appreciated! Thanks in advance!

ptrblck · February 2, 2023, 8:42am

This would mean that your overall training is not stable and depends on the applied seed.
In case your training doesn’t take too much time, you could rerun it using some different seeds and check the success/failure rate of your model being able to converge.

adityatv · February 2, 2023, 3:27pm

Thanks for the quick response! I ran my model, for 100 epochs where validation occurs every 10, with several seeds (0, 10, 20, 30, 40, 50) in the case where the torch.manual_seed(seed) is commented and out and when it isn’t (so the line of interest is the only difference in the entire procedure).

In both cases, the validation accuracies do seem to converge (some seeds are faster than others) in the sense that the accuracy only ever differs by a couple of hundredths near the end. The only difference between them is the value to which it converges which is around 0.81-0.85 (w/o torch.manual_seed) and 0.61-0.65 (w/ torch.manual_seed).

ptrblck · February 2, 2023, 6:16pm

Does this mean that you have executed two different experiments which have converged to a different final accuracy?
If so, then I don’t think these are enough data points to point to an issue in seeding the code.

adityatv · February 2, 2023, 6:35pm

Yeah, I ran the two experiments (with and without the torch.manual_seed and 100 vs 200 epochs) multiple times to get a sense of the test accuracy. Do you have any suggestions for how to determine if this is a seeding issue or not? The confusing part for me is why only the torch.manual_seed is contributing to this difference (and not all torch-related seed setters). I could run this experiment for more seed values but maybe that isn’t enough to prove anything.

Thanks again!

ptrblck · February 2, 2023, 11:40pm

To check the overall stability of your training you could rerun the experiment multiple times using a different seed in each run to collect the failure rate. The number of experiments will also depend on the duration it takes to claim the model converged or the training failed.

adityatv · February 3, 2023, 9:44pm

So, I decided to rerun my model 50 times (maybe that isn’t enough) with a seed value and 50 times without a seed value. In the former, I chose seed values from 1 to 50, and on 47/50 of the runs, the test accuracy ended up being within 0.62-0.67. In the latter, on 47/50 of the runs, the test accuracy was within 0.80-0.87. The outliers, in both cases, were either 0.05 below or over 0.10-0.20 above (one of my non-seeded runs exceeded a test accuracy of 0.92).

If I need to rerun my model several more times I can. Also, I don’t know if this would change anything, but I am using the PyTorch Geometric IMDB as my dataset.

ptrblck · February 3, 2023, 10:03pm

This doesn’t sounds right. Are you only using torch.maual_seed or are you using the set_seed method which also disables cuDNN etc?

adityatv · February 3, 2023, 10:15pm

I’m only using the functions I’ve listed above which don’t include set_seed. I tried to reduce my model parameters from 28 million to 3 million to see if it might have to do with the difference in model parameters and dataset size but I’m experiencing the same results as I’ve shared.

ptrblck · February 3, 2023, 10:35pm

OK, this is getting interesting, as you claim to see a difference in convergence depending if you are calling torch.manual_seed or not.
Would it be possible for us to reproduce the issue, i.e. could you share executable code?

adityatv · February 9, 2023, 4:53am

Sorry for the delayed response. I have been trying to test my code on different datasets to see if there are any patterns, but I seem to be getting a 10-20% increase in accuracy each time. One thing that I realized is that by not setting the seed explicitly it may be the case that each process/GPU might have a different seed value (though I’m not sure why torch.manual_seed function would be impacted the most). Looking at the DDP documentation, it seems to say that upon construction, the initial model parameters from rank 0 are broadcasted to the remaining ranks (so each model starts off the same). Is this still accurate (I see a note about v1.4)?

Also, what is the default value for torch.manual_seed if an explicit seed isn’t provided?

Thanks again in advance!

eqy · February 14, 2023, 7:17am

I’m not sure there is a “default” manual seed across different processes if not explicitly specified, because as you observed, their results differ when it is not set explicitly. When reproducing results across different processes/runs you would need manual_seed for RNG operations at a minimum—please also check the reproducibility docs to see if any caveats discussed there are applicable to your use-case: Reproducibility — PyTorch 1.13 documentation