I am facing an issue similar as this one:
[Pytorch incorrect value of member variable when using Multi-gpu - Stack Overflow](member variable for multi-GPU)
One simplest example for this issue is as following:
## debug.py
import torch
import torch.nn as nn
class Conv2d(nn.Conv2d):
def forward(self, input):
self.foo = torch.ones(input.shape[0])
print('A: ', self.foo.size())
def main():
m = Conv2d(4, 3, 2)
m = nn.DataParallel(m).cuda()
m.module(torch.ones(1, 4, 6, 6).cuda()) # this step is necessary for m to have 'foo' as a member
for bs in range(2, 5):
m(torch.ones(bs, 4, 6, 6).cuda())
print('B: ', m.module.foo.size())
if __name__ == '__main__':
main()
If I print at ‘A’, the result is always correct as every time I forward with some input. However, if I want to get the attribute outside somewhere, it will be correct only if I do use a single GPU. For example, running the above code with a single GPU gives
A: torch.Size([1])
A: torch.Size([2])
B: torch.Size([2])
A: torch.Size([3])
B: torch.Size([3])
A: torch.Size([4])
B: torch.Size([4])
while if I run it with multiple GPUs, it will give some different and incorrect result like
A: torch.Size([1])
A: torch.Size([1])
A: torch.Size([1])
B: torch.Size([1])
A: torch.Size([2])
A: torch.Size([1])
B: torch.Size([1])
A: torch.Size([2])
A: torch.Size([2])
B: torch.Size([1])
I think this is related to some synchronization problem but I wonder the detailed reason for this, and the correct way to set member variables to the module.
I am using Python 3.6.9, PyTorch 1.3.0, CUDA 10.0.130.
Thanks a lot.