Using register_buffer with DataParallel and cuda

I am using DataParallel to allow the program run on several GPUs, and use register_buffer to define some variable for my modules, like this:

class A(nn.Module):
    def __init__(self):
        ...
        self.register_buffer('foo', torch.empty_like(self.weight, dtype=torch.long))

    def forward(self):
        ...
        print(self.foo)
        self.foo = torch.tensor(2018)
        print(self.foo)

For the parallel, I use

device = torch.device('cuda:0')
model = ...
model = nn.DataParallel(model, device_ids=[0, 1, 2, 3])
model = model.to(device)

I will get the following output and clearly the user-defined buffer “foo” is not assigned correctly:

a: tensor([140174715505560], device=‘cuda:3’)
b: tensor(2018)
a: tensor([140174715505560], device=‘cuda:2’)
b: tensor(2018)
tensor([140174715505560], device=‘cuda:0’)
b: tensor(2018)
a: a: tensor([140174715505560], device=‘cuda:1’)
b: tensor(2018)
a: tensor([140174715505528], device=‘cuda:3’)
b: tensor(2018)
tensor([140174715505528], device=‘cuda:2’)
b: tensor(2018)
tensor([140174715505528], device=‘cuda:0’)
b: tensor(2018)

I am using nvidia 2080 and cuda 9.2. There is some other bugs reported, but does not influence the program running. The bugs are like this:

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1535493744281/work/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1535493744281/work/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1535493744281/work/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument

Also, I did not use the following part (by comment these two lines)

import torch.backends.cudnn
cudnn.benchmark = True

because otherwise there will be some other bugs and the program cannot run.

Could you try to assign the device to your new tensor using the device of another parameter?
E.g. if your A module has some layers or other parameters stored, try the following:

print(self.foo)
self.foo = torch.tensor(2018, device=self.another_parameter.device)
print(self.foo)

Thanks for your reply. I tried now, but the result is the same.

Here is the modified part:

self.foo = torch.tensor(2018, device=self.weight.device)

Because A is a layer, I use self.weight.

It seems to me that inside the method of the module class, the buffer variable can be assigned correctly, but after that, it is reinitialized, or maybe not modified at all in the method of the class.

I also print the buffer variable outside the code file which contains the definition of class is as following:

for m in model.modules():
    if isinstance(m, A):
        print('c: ', m.foo.item)

and will output the random large number, instead of 2018.

With single GPU, I do not need to use the buffer, and there is no such problem.

Thanks for the explanation,
In that case, could you try to use an in-place method like self.foo.fill_(2018.)?

1 Like

Thanks for your help. It doesn’t work either.:frowning_face:

Actually the print out is a little messy, and it seems that there is some conflicts among the four GPUs to print out values:

a: a: a: tensor([140339592887176], device=‘cuda:7’)
tensor([140339592887176], device=‘cuda:6’)tensor([140339592887176], device=‘cuda:5’)a:

b: b: tensor([140339592887176], device=‘cuda:4’)b: tensor([2018], device=‘cuda:7’)

tensor([2018], device=‘cuda:5’)tensor([2018], device=‘cuda:6’)

b: tensor([2018], device=‘cuda:4’)
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1535493744281/work/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1535493744281/work/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1535493744281/work/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument
a: tensor([140339592887176], device=‘cuda:4’)a:
a: a: b: tensor([140339592887176], device=‘cuda:6’)
tensor([2018], device=‘cuda:4’)
b: tensor([2018], device=‘cuda:6’)
tensor([140339592887176], device=‘cuda:5’)tensor([140339592887176], device=‘cuda:7’)

b: b: tensor([2018], device=‘cuda:5’)tensor([2018], device=‘cuda:7’)

a: tensor([140339592887176], device=‘cuda:4’)
a: b: tensor([140339592887176], device=‘cuda:6’)

I can only understand this by that different layers are assigned to different GPUs, or maybe the model is assigned multiple times to different GPUs. I do not quite understand parallel computing.

The output look fine. Since CUDA calls are asynchronously, you won’t necessarily see a specific order of the print statements.
However, if looks like self.foo is now on different devices.
Are you getting any other error message now?

Thanks. Actually no, the only error messages are the three ones posted above, and the program can run until finished.

I was trying to understand how to use register_buffer by checking the code of batch_norm, but it use some function torch.batch_norm, which seems not a python function, but implemented in C/C++.

I tried a very tiny toy example. The output seems to be correct. There is no such problem in the tiny example. Since there are around 5 files even in this tiny example, I cannot copy all of them here.

Thanks very much. I will try to debug a little.

Thank you very much. The problem is solved. The most important thing is that I can only use self.foo.fill_(some_tensor.item()), and avoid ANY using of self.foo = some_tensor. Even if I use self.foo.fill_(value) after self.foo = some_tensor, the result is incorrect. I don’t know why. Anyway, thank you very much.

1 Like

Good to hear it’s working now!
I still think it should be working if you specify the device in the tensor creation.

Thanks. The following does not work:

self.foo = self.weight.new(2018).long()

Hi, I feel sorry to bother you again, but after a long time simulation, the results seem to be incorrect. Do you have any idea of the reason? I think there should be no other bugs as the program is simple and the result is correct when I use single GPU. Thanks again.

With single GPU, I comment the following line

model=nn.DataParallel(model, device_ids=....)

The way DataParallel works is that at each iteration, it creates a replica of the input model on each device. So each device has it’s own replica of the model, a different python object than the input model. So your original approach of directly assigning doesn’t work, as it is assigning as an attribute on a replica, which is freed after the iteration.

The weights & buffers of these replicas are linked to the original weights & buffers in input model by autograd graph, so backward still works. Moreover, weights & buffers of the replica on device[0] share storage with those of the input model! So the inplace version “works” in the sense that changes done on the first replica (on device[0]) will propagate through. However, changes from other replicas are lost.

3 Likes

Thank you very much for the explanation. I partly understand what you said now. So if I want to, say check some self defined status, like “foo” of the weights of layers in the model, using multiple GPUs (as I need to do experiments on very large dataset), is there any way to achieve what I want? Thank you again.

It seems that if I only specify one GPU, even if I use nn.DataParallel, for example, as the following

model = nn.DataParallel(model, device_ids=[0])

the result will be correct. But if I use more than one, say

model = nn.DataParallel(model, device_ids=[0, 1])

the result will make no sense.

Hi,
did you manage to solve this problem? I am facing something simillar.

I think DistributedDataParallel does not have this problem. I have to say I totally forgot how did I solve this. Sorry for this. I should write down the final answer back then.