Reproducing results across different machines (CPU) - Dropout layers

pedro.mat94 · August 16, 2024, 5:58pm

I was trying to reproduce the results for a CNN in two different machines (both running on CPU) but I ended up with different results (setting the same python and torch seed in advance). After checking the CNN layers individually, I noticed that the difference is caused in the Dropout layer specifically. I couldn’t find the exact implementation for Dropout, so I was wondering if there’s any idea on what may be behind this?

Some additional details:

import random
import torch
random.seed(2)
torch.manual_seed(3)
a = torch.randn(4, 4)
dp1 = torch.nn.Dropout(p=0.3)
dp1(data)
# Result in one of the machines:
# tensor([[-1.0361, -3.5810, -0.0000,  1.8477],
#         [-2.0117, -0.0000, -0.0785,  1.7217],
#         [-0.0000, -2.3690, -0.0955, -0.7392],
#         [-0.0000,  1.5011,  1.8367,  2.3154]])
# Result for the other machine:
# tensor([[-1.0361, -0.0000, -1.1379,  1.8477],
#         [-2.0117, -1.2687, -0.0000,  1.7217],
#         [-1.4952, -2.3690, -0.0000, -0.7392],
#         [-0.4844,  1.5011,  1.8367,  2.3154]])

I also tried the tips from the reproducibility page - using deterministic algorithms, … However, the results are still different.
I used a docker image, so library versions are exactly the same (Python 3.9/torchvision 0.11.1/torch 1.10.0)

Any tips/ideas are welcome, thanks.

ptrblck · August 16, 2024, 6:27pm

As described in the reproducibility docs there is no guarantee to receive exactly the same random values between different platforms, CPUs, GPUs etc.

pedro.mat94 · August 20, 2024, 12:48pm

Indeed, I thought there could be a specific reason given that it happened only in a specific layer. Thanks.
(In the meantime, I tested it out in a AWS machine with an ARM architecture and I was able to get the same results. I didn’t use AWS for my original experiments and my docker image was built for x86 so I still have to understand what happened but in any case hope this may help someone).