Initializing a member tensor after creation with DataParallel (repost)

pytorcher · August 24, 2020, 8:09pm

I have a member tensor that is created/saved during a backward hook. Hook looks like:

def saveGrad(self, grad_input, grad_output):
        self.currentGrad = grad_output[0].detach()

This works fine on one GPU but when running on multiple it tells me 'AttributeError: (module) object has no attribute ‘currentGrad’ '. I imagine I could do a .copy_() or something, but I won’t actually know the shape until the network has been run on an image.

albanD · August 24, 2020, 9:04pm

Hi,

Where do you get this error? Keep in mind that when running on multiple GPUs, a different clone of the Module runs on each device. So they don’t share self.

pytorcher · August 24, 2020, 9:57pm

Right, that is my understanding. But I don’t know what to do to ensure something is accessible in both clones. I am getting the error in a member function

    def memberFunction(self):
            self.otherFunction(self.currentGrad)

From Multi GPU backwards hook on wrong device the register_buffer fixed this error for all the places where I do know the tensor size ahead of time. But the question is essentially, how do i register_buffer so it works on multiple GPUs if the size is dependent on the input.

albanD · August 24, 2020, 9:59pm

I am not sure to understand why not knowing the size is an issue? Can you just resize the Tensor before using it?

pytorcher · August 24, 2020, 10:22pm

Possibly, how would I do that? I tried:
self.register_buffer('currentGrad', torch.zeros(1).to(gf.device).double())
in init() and then in the backward hook I do:

def saveGrad(self, grad_input, grad_output):
    if(len(self.currentGrad) == 1):
        self.currentGrad = self.currentGrad.repeat(grad_output[0].shape)
    self.currentGrad = grad_output[0].detach()

But that gives me an error that implies it did not save the modification:

ValueError: not enough values to unpack (expected 4, got 1)
Uncaught exception. Entering post mortem debugging

pytorcher · August 28, 2020, 4:33am

Been a few days, any thoughts?

albanD · August 28, 2020, 1:58pm

The error seems to point to another place in the code no? You need to make sure not to unpack too many values.

Also for torch.zeros(1).to(gf.device).double(), you can replace it with torch.zeros(1, device=gf.device, dtype=torch.double).

pytorcher · August 28, 2020, 2:39pm

Ah, good to know on the second part.

But no, your suggestion above was to initialize it as a buffer so it gets picked up by dataparallel and then resize it later before using it, correct? When I try to do that it looks like the resize is not getting picked up by the buffer that was registered and when it gets called one (or both) of the copies still have the original size.

Just to put all the code in the same place:

def init(self):
    ...
    self.register_buffer('currentGrad', torch.zeros(1).to(gf.device).double())

def memberFunction(self):
    self.otherFunction(self.currentGrad)

def saveGrad(self, grad_input, grad_output):
    if(len(self.currentGrad) == 1):
        self.currentGrad = self.currentGrad.repeat(grad_output[0].shape)
    self.currentGrad = grad_output[0].detach()

What I had noticed with the earlier thread is it seems like setting things with self.currentGrad=... would cause the problem where it is not found but self.currentGrad += ... for in place modifications did work. Is there a way to do an in place resize maybe?

albanD · August 28, 2020, 3:07pm

Sorry this thread is quite long and I don’t have it in my head.

But each “self” that you get on different devices are different. So any change you make to one of them will not be reflected on the others, or the original module.
But for a given Module, while running on its GPU, if you set a value in the forward, it will still be there in the forward.

Also as mentioned in the doc, register_backward_hook() is not working properly at the moment and should not be used.

pytorcher · August 28, 2020, 3:49pm

I feel like that is not what @ptrblck was saying here: Multi GPU backwards hook on wrong device. It sounded like doing register_buffer allows you to define “self” variables that ensure that they work properly on multiple GPUs. Did I misread his post about that?

I feel like I need to use register_backward_hook. It is also working entirely correct for me on one GPU. Is there a page or thread you can point me to for what to do instead?

albanD · August 28, 2020, 3:59pm

It does, but depends what you want to do.
Having it being a buffer means that the replication code will properly replicate it and move it to the right GPU for this particular replica. So you will be able to use it properly with the inputs from this Module.
But that does not mean that it will be shared between the replicas!

pytorcher · August 28, 2020, 8:24pm

Ok, so I think that is essentially the core of my question. Is it possible to resize a buffer after it has been declared? And additionally is it possible to add a buffer after DataParallel has been called? Or if I want to do these things, will I need to essentially do:

net = net.module#get rid of data parallel
net.registerNewBuffers()
net = DataParallel(net)#redeclare as data parallel

And As an extra question, if I have 10 buffers do I need to have them all have a personal name or can I make an array of buffers?

albanD · August 31, 2020, 3:27pm

Is it possible to resize a buffer after it has been declared?

Yes, you can do .resize_(). Note that the content of the Tensor will be un-initialized though. So you need to make sure that you either zero it or write some value before using it.

And additionally is it possible to add a buffer after DataParallel has been called?

Maybe but I would not recommend that. It sounds quite fragile.

if I have 10 buffers do I need to have them all have a personal name or can I make an array of buffers?

No you cannot do an array of buffers at the moment. You will have to register them one by one.

pytorcher · September 12, 2020, 2:40pm

I think the resize worked! But replacing register_backwards_hook() did not fix my original problem. I’ll post a new thread for that for where I am now. Thanks!