Using register_buffer with DataParallel and cuda

deJQK · November 18, 2018, 12:34am

I am using DataParallel to allow the program run on several GPUs, and use register_buffer to define some variable for my modules, like this:

class A(nn.Module):
    def __init__(self):
        ...
        self.register_buffer('foo', torch.empty_like(self.weight, dtype=torch.long))

    def forward(self):
        ...
        print(self.foo)
        self.foo = torch.tensor(2018)
        print(self.foo)

For the parallel, I use

device = torch.device('cuda:0')
model = ...
model = nn.DataParallel(model, device_ids=[0, 1, 2, 3])
model = model.to(device)

I will get the following output and clearly the user-defined buffer “foo” is not assigned correctly:

a: tensor([140174715505560], device=‘cuda:3’)
b: tensor(2018)
a: tensor([140174715505560], device=‘cuda:2’)
b: tensor(2018)
tensor([140174715505560], device=‘cuda:0’)
b: tensor(2018)
a: a: tensor([140174715505560], device=‘cuda:1’)
b: tensor(2018)
a: tensor([140174715505528], device=‘cuda:3’)
b: tensor(2018)
tensor([140174715505528], device=‘cuda:2’)
b: tensor(2018)
tensor([140174715505528], device=‘cuda:0’)
b: tensor(2018)

I am using nvidia 2080 and cuda 9.2. There is some other bugs reported, but does not influence the program running. The bugs are like this:

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1535493744281/work/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1535493744281/work/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1535493744281/work/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument

Also, I did not use the following part (by comment these two lines)

import torch.backends.cudnn
cudnn.benchmark = True

because otherwise there will be some other bugs and the program cannot run.

ptrblck · November 18, 2018, 8:22pm

Could you try to assign the device to your new tensor using the device of another parameter?
E.g. if your A module has some layers or other parameters stored, try the following:

print(self.foo)
self.foo = torch.tensor(2018, device=self.another_parameter.device)
print(self.foo)

deJQK · November 18, 2018, 8:50pm

Thanks for your reply. I tried now, but the result is the same.

Here is the modified part:

self.foo = torch.tensor(2018, device=self.weight.device)

Because A is a layer, I use self.weight.

It seems to me that inside the method of the module class, the buffer variable can be assigned correctly, but after that, it is reinitialized, or maybe not modified at all in the method of the class.

I also print the buffer variable outside the code file which contains the definition of class is as following:

for m in model.modules():
    if isinstance(m, A):
        print('c: ', m.foo.item)

and will output the random large number, instead of 2018.

With single GPU, I do not need to use the buffer, and there is no such problem.

ptrblck · November 18, 2018, 8:55pm

Thanks for the explanation,
In that case, could you try to use an in-place method like self.foo.fill_(2018.)?

deJQK · November 18, 2018, 8:59pm

Thanks for your help. It doesn’t work either.

deJQK · November 18, 2018, 9:03pm

Actually the print out is a little messy, and it seems that there is some conflicts among the four GPUs to print out values:

a: a: a: tensor([140339592887176], device=‘cuda:7’)
tensor([140339592887176], device=‘cuda:6’)tensor([140339592887176], device=‘cuda:5’)a:

b: b: tensor([140339592887176], device=‘cuda:4’)b: tensor([2018], device=‘cuda:7’)

tensor([2018], device=‘cuda:5’)tensor([2018], device=‘cuda:6’)

b: tensor([2018], device=‘cuda:4’)
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1535493744281/work/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1535493744281/work/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1535493744281/work/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument
a: tensor([140339592887176], device=‘cuda:4’)a:
a: a: b: tensor([140339592887176], device=‘cuda:6’)
tensor([2018], device=‘cuda:4’)
b: tensor([2018], device=‘cuda:6’)
tensor([140339592887176], device=‘cuda:5’)tensor([140339592887176], device=‘cuda:7’)

b: b: tensor([2018], device=‘cuda:5’)tensor([2018], device=‘cuda:7’)

a: tensor([140339592887176], device=‘cuda:4’)
a: b: tensor([140339592887176], device=‘cuda:6’)

I can only understand this by that different layers are assigned to different GPUs, or maybe the model is assigned multiple times to different GPUs. I do not quite understand parallel computing.

ptrblck · November 18, 2018, 10:04pm

The output look fine. Since CUDA calls are asynchronously, you won’t necessarily see a specific order of the print statements.
However, if looks like self.foo is now on different devices.
Are you getting any other error message now?

deJQK · November 18, 2018, 10:21pm

Thanks. Actually no, the only error messages are the three ones posted above, and the program can run until finished.

I was trying to understand how to use register_buffer by checking the code of batch_norm, but it use some function torch.batch_norm, which seems not a python function, but implemented in C/C++.

deJQK · November 19, 2018, 2:09am

I tried a very tiny toy example. The output seems to be correct. There is no such problem in the tiny example. Since there are around 5 files even in this tiny example, I cannot copy all of them here.

Thanks very much. I will try to debug a little.

deJQK · November 19, 2018, 8:32am

Thank you very much. The problem is solved. The most important thing is that I can only use self.foo.fill_(some_tensor.item()), and avoid ANY using of self.foo = some_tensor. Even if I use self.foo.fill_(value) after self.foo = some_tensor, the result is incorrect. I don’t know why. Anyway, thank you very much.

ptrblck · November 19, 2018, 11:41am

Good to hear it’s working now!
I still think it should be working if you specify the device in the tensor creation.

deJQK · November 19, 2018, 4:19pm

Thanks. The following does not work:

self.foo = self.weight.new(2018).long()

deJQK · November 21, 2018, 6:42am

Hi, I feel sorry to bother you again, but after a long time simulation, the results seem to be incorrect. Do you have any idea of the reason? I think there should be no other bugs as the program is simple and the result is correct when I use single GPU. Thanks again.

With single GPU, I comment the following line

model=nn.DataParallel(model, device_ids=....)

SimonW · November 21, 2018, 7:34am

The way DataParallel works is that at each iteration, it creates a replica of the input model on each device. So each device has it’s own replica of the model, a different python object than the input model. So your original approach of directly assigning doesn’t work, as it is assigning as an attribute on a replica, which is freed after the iteration.

The weights & buffers of these replicas are linked to the original weights & buffers in input model by autograd graph, so backward still works. Moreover, weights & buffers of the replica on device[0] share storage with those of the input model! So the inplace version “works” in the sense that changes done on the first replica (on device[0]) will propagate through. However, changes from other replicas are lost.

deJQK · November 21, 2018, 7:51am

Thank you very much for the explanation. I partly understand what you said now. So if I want to, say check some self defined status, like “foo” of the weights of layers in the model, using multiple GPUs (as I need to do experiments on very large dataset), is there any way to achieve what I want? Thank you again.

deJQK · November 21, 2018, 4:55pm

It seems that if I only specify one GPU, even if I use nn.DataParallel, for example, as the following

model = nn.DataParallel(model, device_ids=[0])

the result will be correct. But if I use more than one, say

model = nn.DataParallel(model, device_ids=[0, 1])

the result will make no sense.

Cosimo_Rulli · October 20, 2021, 4:13pm

Hi,
did you manage to solve this problem? I am facing something simillar.

deJQK · October 20, 2021, 6:42pm

I think DistributedDataParallel does not have this problem. I have to say I totally forgot how did I solve this. Sorry for this. I should write down the final answer back then.