I’m trying to save a value during a backwards hook function with self.X=value1. This works perfectly with 1 GPU but on multiple GPU’s is does not. I can even make it a constant with self.testval = 5. With CUDA_VISIBLE_DEVICES=0 set before the call to backward() it says it has no attribute named testval, but right after it shows it is 5. with CUDA_VISIBLE_DEVICES=0,1 it says no attribute both times. I am completely stuck here.
Do you apply the hook before or after wrapping the model in DataParallel
?
I call register_backward_hook when I initialize the nn.Module, then I call DataParallel on the module, then the training with the calls to backward()
when you call register_backward_hook
it traverses the modules that the module has. When you initialize a DataParallel
module, it creates a copy of each module for each GPU. I would try doing the register on the DataParallel
module
Hmmm, that definitely seems like it could be the problem! Right now I am just calling DataParallel on the whole network, while I need to do the backward hook on individual layers. I don’t seem to be able to do net.module.layerID.register_backward_hook. I imagine the net.module part of that is the problem. Do I need to call DataParallel on each layer individually or is there a better way to tell a dataparallel I just want to call the backward hook on one of its submodules?
90% sure the above solved the problem calling dataparallel on each layer individually
Hi,
In my shallow view, the layers is nn.Module
same to the whole network, I think if you solve the problem calling dataparallel on each layer individually, you could also calling dataparallel on the whole network and register_backward_hook to specific layers.
Do you mean even if I do end up going in and calling Dataparallel on each layer to solve the problem I should still additionally call DataParallel on the network?