How could I fix the random seed absolutely

coincheung · May 17, 2019, 3:24pm

Hi,

I add these lines at the beginning of my code, and the main.py of my code goes like this:

import torch
import torch.nn as nn
import numpy as np
import random
import my_model
import my_dataset

torch.manual_seed(123)
torch.cuda.manual_seed(123)
np.ranom.seed(123)
random.seed(123)
torch.backends.cudnn.enabled=False
torch.backends.cudnn.deterministic=True

def train():
...

if __name__ == '__main__':
    train()

I run my program for two times without modifying the code. The loss of the first iter are both 9.044713973999023, but the loss of the second iter becomes different with one to be 9.045238494873047 and the other 9.045231819152832.

The reason why I am so concerned about these random behavior is that I find the evaluation results of running the same code for two times can become as large as around 0.5%. I trained my model on cityscapes for 80k iter, and I believe it should be enough for the model to be converged. Thus I do not quite understand why there is such a gap between running two times, and why the loss goes different even at the second iter? How could I fix the random behavior to make two results to be identical as expected please?

colesbury · May 17, 2019, 4:32pm

That should generally be sufficient. A few operations have non-deterministic gradient calculations. For example, the gradient calculation of index_select() with duplicate indices can result in non-determinism due to the ordering of floating point additions.

The description of the operators with non-deterministic behavior is here:
https://pytorch.org/docs/stable/notes/randomness.html#pytorch

Try to narrow down where the non-determinism is introduced. Use backward hooks to print out where in the network the gradient starts to vary between runs.

coincheung · May 18, 2019, 1:23am

Thanks for replying!!

I just do not understand, will these randomness bring a performance gap as much as 0.5% when I run the identical code for two times?

coincheung · May 18, 2019, 1:21pm

Hi,

In the code above, did I place the lines of torch.manual_seed ... in the correct locations? I used torch.distributed to manipulate multi-gpu training. Do I need to move these lines after dist.init_process_group or other locations?

colesbury · May 18, 2019, 10:36pm

dist.init_process_group should be fine, but if you’re using torch.multiprocessing.spawn or similar you need to make sure that the seeding happens after you spawn

coincheung · May 19, 2019, 12:07pm

Thanks a lot for suggesting hook!!

I tried to use hook to print the mean of the grad tensor, and I find that there is an F.interpolate in my code which brings different gradients each time. Do I have any method to eliminate this randomness? With this randomness, the result fluctuates violently and I cannot tune the hyperparameters of my model.

Deeply · May 19, 2019, 1:47pm

I am also confused withe too many seeding methods, there are these seeding metods too…

torch.random.initial_seed()  
torch.cuda.manual_seed_all(seed_value)
torch.cuda.set_rng_state(cuda_rng_state)
torch.set_rng_state(rng_state)

MariosOreo · May 20, 2019, 3:18am

Hello,

It seems that non-deterministic of interpolate is a common question.
Here is a thread in this forums about interpolate, it might give you more information.

coincheung · June 9, 2019, 2:53am

@colesbury @MariosOreo @Deeply
HI,

I come into another problem that I suspect is associated with random behavior. I am training a resnet18 on cifar-10 dataset. The model is simple and standard with only conv2d, bn, relu, avg_pool2d, and linear operators. There still seems to be random behavior problems, even though I have set the seed of random generator. I come to ask is there a random problem with avg_pool2d as the interpolation?

Tell me and I will copy my code here if you feel it is the problem with my code .

MariosOreo · June 9, 2019, 4:00am

According to the doc,

There are some PyTorch functions that use CUDA functions that can be a source of non-determinism. One class of such CUDA functions are atomic operations, in particular atomicAdd , where the order of parallel additions to the same value is undetermined and, for floating-point variables, a source of variance in the result.

AND a number of operations have backwards that use atomicAdd, such as many forms of pooling, padding, and sampling. There currently is no simple way of avoiding non-deterministic in these functions.