I am using DataParallel to allow the program run on several GPUs, and use register_buffer to define some variable for my modules, like this:
class A(nn.Module):
def __init__(self):
...
self.register_buffer('foo', torch.empty_like(self.weight, dtype=torch.long))
def forward(self):
...
print(self.foo)
self.foo = torch.tensor(2018)
print(self.foo)
For the parallel, I use
device = torch.device('cuda:0')
model = ...
model = nn.DataParallel(model, device_ids=[0, 1, 2, 3])
model = model.to(device)
I will get the following output and clearly the user-defined buffer “foo” is not assigned correctly:
a: tensor([140174715505560], device=‘cuda:3’)
b: tensor(2018)
a: tensor([140174715505560], device=‘cuda:2’)
b: tensor(2018)
tensor([140174715505560], device=‘cuda:0’)
b: tensor(2018)
a: a: tensor([140174715505560], device=‘cuda:1’)
b: tensor(2018)
a: tensor([140174715505528], device=‘cuda:3’)
b: tensor(2018)
tensor([140174715505528], device=‘cuda:2’)
b: tensor(2018)
tensor([140174715505528], device=‘cuda:0’)
b: tensor(2018)
I am using nvidia 2080 and cuda 9.2. There is some other bugs reported, but does not influence the program running. The bugs are like this:
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1535493744281/work/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1535493744281/work/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1535493744281/work/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument
Also, I did not use the following part (by comment these two lines)
import torch.backends.cudnn
cudnn.benchmark = True
because otherwise there will be some other bugs and the program cannot run.