Register forward hook with multiple GPUs

antspy · January 10, 2018, 2:55pm

What exactly is the behavior of register forward hook with multiple GPUs?
I want to save the outputs of each layer in my model. For now I have this code:

outputs_layers = []
def save_outputs():
    def hook(module, input, output):
        outputs_layers.append(output.data)
        print(len(outputs_layers))
        return None

    return hook

The problem is that, with multiple GPUs, this does not work; each GPU will receive a fraction of the input, so we need to aggregate the results coming from different GPUs.

This can be done easily, for example by making the outputs_layer a dict and concatenating the outputs with the same key. To make this work, though, we would need to be assured that the order in which the GPUs return the values is always the same, and the same order as the inputs.

So, in general, how can we use forward hooks with multiple GPUs?

antspy · February 5, 2018, 1:00pm

Hi! Any ideas? This would really help me out

richard · February 6, 2018, 3:16pm

I don’t think you can rely on the hooks running deterministically. However, the list of device ids you passed into DataParallel (or the default [0, 1, 2, 3] assuming 4 GPUs) specifies the order in which your data gets split across devices.

For example, for device ids [0, 1, 2, 3], the first quarter of data will get sent to device 0, etc, etc.You can use this information to reconstruct the order of your outputs: each output captured by a hook will be on a specific GPU.

antspy · February 6, 2018, 3:34pm

Ah, right! But then the problem is solved, I just look at the device and reconstruct the right order

lolongcovas · April 20, 2019, 11:10am

any example of the solution? I am finding the same issue.

JinyangGuo · May 10, 2019, 7:29am

Hi, I try to register hook and run it on multiple GPUs. However, it only return the result on GPU 0.
The data has been successfully split on multiple GPUs.
Anyone know why is it?
The code:

def forward(self, x):
        self.activations = []
        self.gradients = []
        self.grad_index = 0
        self.activation_to_layer = {}

        activation_index = 0

        for layer, module in self.model.named_modules():
            if ('conv' in layer) or ('pool' in layer) or ('fc' in layer):
                if 'fc6' in layer:
                    if isinstance(self.model, nn.DataParallel):
                        x = x.view(-1, self.model.module.fc6.in_features)
                    else:
                        x = x.view(-1, self.model.fc6.in_features)
                x = module(x)
            if isinstance(module, torch.nn.modules.conv.Conv3d):
                # hook will registered on the output of the layer
                x.register_hook(self.compute_rank)
                self.activations.append(x)
                self.activation_to_layer[activation_index] = layer
                activation_index += 1
                x = model.relu(x)
            elif isinstance(module, torch.nn.modules.Linear) and layer != 'fc8':
                x = model.dropout(model.relu(x))

        return x

def compute_rank(self, grad):
        """
        Compute the Taylor expansion without abs of each channel
        return:
            self.activations: feature map before relu in each layer
            self.filter_ranks: Taylor value without abs over spatial and batch
        """
        activation_index = len(self.activations) - self.grad_index - 1
        activation = self.activations[activation_index]
        if self.pruning_level == 'channel':
            print(self.model)
            print('device', grad.device)
            values = torch.sum((activation * grad), dim = 4).\
                        sum(dim=3).sum(dim=2).sum(dim=0)

            # Normalize the rank by the filter dimensions
            values = values / (activation.size(0) * activation.size(2) \
                                * activation.size(3) * activation.size(4))

            if activation_index not in self.filter_ranks:
                self.filter_ranks[activation_index] = \
                    torch.FloatTensor(activation.size(1)).zero_().cuda()

and the result:

device cuda:0

deJQK · August 9, 2020, 7:29pm

I meet a similar problem. It seems that data parallel with forward hook cannot guarantee the activation to be on the same device as the model weights. I am not sure why, but perhaps because of using dictionary to store the activation, which is not ordered. Is there any suggestion on how to solve this? Thanks.

Btorch · October 15, 2020, 11:48am

Also having the same issue, any idea someone?

Ofri_Masad · January 13, 2021, 10:03am

solved.
here is the basic idea:
instead of outputs list i have defined dictionary of list.
when i get a hook call i add the output to the right list according to the output.device
then when i return from forward i return the list according to the input.device

code:

def __init__(self):
    self._outputs_lists = {}
    self.mymodule.register_forward_hook(hook=self.save_output_hook)

def save_output_hook(self, _, input, output):
    self._outputs_lists[input[0].device].append(output)

def forward(self, x) -> list:
    self._outputs_lists[x.device] = []
    self.mymodule(x)
    return self._outputs_lists[x.device]