Different initialization on 2 GPUs

mpalaourg · October 18, 2021, 7:29am

Hello,
First, I didn’t find any appropriate category, so I’ll type my question here and I am sorry if it’s not the right place. My problem is the following:

I have two GPUs (TitanXp and 2080 Ti), where I get different results on my training, even after I used all the tips here. I pinpoint the issue in the initialization of the network (using the xavier_uniform_ function) and to be exact after calling Tensor.uniform_. The strange part is that for the whole network (85 modules), a part of each module, first 50 -if I remember correctly- rows with 1024 values are identical and from there until the end of the tensor completely different. Do you have any clues, why that’s happen? Furthermore, do you have any idea/“trick” to get the same initialization on both machines?

Thanks in advance.
George

ptrblck · October 18, 2021, 9:03am

Deterministic results are not guaranteed between different hardware releases.
To get the same initialization, you could initialize the model on one device, save the state_dict, and load it on the other.

mpalaourg · October 18, 2021, 9:19am

Thanks for your answer, ptrblck! I thought of that, but I am not keen on the idea of loading the state_dict on the other machine, I would rather go with a more programming approach/solution. On the other hand, I want reproducible results and the model is kind of sensitive in its initialization, that’s why I asked for a “trick”.

Do you ever encounter something like that? My thought is that the behavior I report previously, is due to the different number of cores on each GPU, but I can’t get my head around it and think of a solution.

George