Move tensor failed?

Steve_Hu · November 7, 2019, 8:06am

my code looks like this, i write a forward hook,and try to move ‘QPs’ to the ‘output’ tensor device then do some operations, but my ‘assert’ code failed , because two tensor are not on the same device, i am confused…BTY,i use dataparallel but i think it doesn’t matter… so what happens?

def layer1_hook_fn_forward(module, input, output):
    # output shape 2, 64, 10, 56, 56
    global QPs
    QPs = output.new(QPs)
    print('**************')
    print(output.device)
    print(QPs.device)
    print('**************')
    assert QPs.device == output.device, "tensor device not match"
    return output.mul(QPs)

besides , i tried QPs.to(output.device), it still wrong…

albanD · November 7, 2019, 3:07pm

You need to do QPs = QPs.to(output.device) for it to work as to does not change the Tensor inplace.
Also you can try QPs = output.new(QPs, device=output.device).

Steve_Hu · November 8, 2019, 1:17am

all right,i understand it now and fix it,but i got a error like this post, RuntimeError: OrderedDict mutated during iteration (while using hook),

and no answer under this post, i find this line code in pytorch ‘nn/module.py’ like this
,so is there a bug or my way using hook are not right?

albanD · November 8, 2019, 3:18pm

Not sure. Would you have a small code sample to reproduce this crash please?

Steve_Hu · November 11, 2019, 2:30am

ok, i think ,there exists a little bug in nn/module/module.py, in ‘call’ function, i change the line

        for hook in self._forward_hooks.values():

to

        for hook in list(self._forward_hooks.values()):

and the code will be fine.

The following code is the changed module.py -‘call()’ func:

def __call__(self, *input, **kwargs):
        for hook in list(self._forward_pre_hooks.values()):
            result = hook(self, input)
            if result is not None:
                if not isinstance(result, tuple):
                    result = (result,)
                input = result
        if torch._C._get_tracing_state():
            result = self._slow_forward(*input, **kwargs)
        else:
            result = self.forward(*input, **kwargs)
        for hook in list(self._forward_hooks.values()):
            hook_result = hook(self, input, result)
            if hook_result is not None:
                result = hook_result
        if len(self._backward_hooks) > 0:
            var = result
            while not isinstance(var, torch.Tensor):
                if isinstance(var, dict):
                    var = next((v for v in var.values() if isinstance(v, torch.Tensor)))
                else:
                    var = var[0]
            grad_fn = var.grad_fn
            if grad_fn is not None:
                for hook in list(self._backward_hooks.values()):
                    wrapper = functools.partial(hook, self)
                    functools.update_wrapper(wrapper, hook)
                    grad_fn.register_hook(wrapper)
        return result

albanD · November 11, 2019, 7:45pm

I guess this works because creating the list makes a copy of the hooks data stucture and hides the error.
But the error is still there that you change the hooks during while you run them. Is that what you’re doing?

Steve_Hu · November 12, 2019, 1:54am

right…i have a dynamic weight matrix, and in every time forward process i register the forward hook,change the hook output, then remove it, in model ‘forward’ func. i wonder if i register the hook in model ‘init’ func, the variable QP doesn’t get right value, because the QP is changed by specific input,the forward hook will change the output correctly? if that case, where should i register the forward hook and where should i remove it? i am little confused by the hook mechanic, thanks !


def layer1_hook_fn_forward(module, input, output):
    # output shape 2, 64, 10, 56, 56
    global QPs #it's a variable
    a = QPs.to(output.device)
    assert a.device == output.device, "tensor device not match"
    return output.mul(a)

    def forward(self, inputs):
        # ( (img1,mv1,qp1), (img2,mv2,qp2))
        outputs = []
        for features in inputs:
            mix_features = []
            global QPs
            QPs = self.deal_qp_data(features[2]) #change by input
            # print(next(self.qp_model.parameters()).is_cuda)
            for i in range(len(features)):
                if i == 0:
                    # for rgb
                    features[i] = self.data_bn_channel_3(features[i])
                    handle1 = self.base_model_channel_3.layer1.register_forward_hook(layer1_hook_fn_forward)
                    x = self.base_model_channel_3(features[i])
                    handle1.remove()
                if i == 1:
                    # for mv and residual need batch_normalization
                    features[i] = self.data_bn_channel_2(features[i])
                    handle2 = self.base_model_channel_2.layer1.register_forward_hook(layer1_hook_fn_forward)
                    x = self.base_model_channel_2(features[i])
                    handle2.remove()
                if i == 2:
                    continue
                x = self.dropout(x)
                x = self.key_feature_layer(x)
                # x = (batch, features)
                mix_features.append(x)
            mix_features = torch.cat([mix_features[0], mix_features[1]], dim=1)
            # print(mix_features.shape)
            outputs.append(mix_features)
        x = self.fc_layer_1(torch.abs(outputs[0] - outputs[1]))
        x = F.relu(x)
        x = self.fc_layer_2(x)
        x = torch.sigmoid(x)
        x = self.clf_layer(x)
        return outputs, x

albanD · November 12, 2019, 4:32am

Could you create a small code sample (30-40 lines) that reproduces this problem please? The code you showed above looks ok to me.

Steve_Hu · November 12, 2019, 7:08am

i write the following code to test ,it’ right, but the last code segment is still wrong, so the error means i change the hook during running? besides my model is a DataParallel model… so where is the problem…i am thinking.

import torch
from torch import nn
import numpy as np
QP = []
class SimpleConv(nn.Module):
    def __init__(self):
        super(SimpleConv, self).__init__()
        self.layer1 = nn.Linear(1,32)
        self.layer2 = nn.Linear(32,64)

    def forward(self, x):
        global QP
        QP = np.repeat(x,32)
        handle = self.layer1.register_forward_hook(layer1_hook_fn_forward)
        x = self.layer1(x)
        handle.remove()
        x = self.layer2(x)
        return x

def layer1_hook_fn_forward(module, input, output):
    # this hook will change the output
    global QPs
    a = QP.to(output.device)
    assert a.device == output.device, "tensor device not match"
    return output.mul(a)

for i in range(50):
    model = SimpleConv()
    a= torch.tensor([1.0])
    model(a)

albanD · November 12, 2019, 8:06pm

Hi,

I can run this sample without errors locally. Is it not working for you?

Steve_Hu · November 13, 2019, 1:13am

yeah,i can run this sample without errors too, but my real project code is still wrong…if i change the source code like pre post and make the project can run, from the view of hook function that change the layer output, can i do that?

albanD · November 13, 2019, 1:17am

Sorry I’m not sure to understand your question here, could you reformulate it?

Steve_Hu · November 13, 2019, 1:22am

I mean, now my example can run through, but my project will still report an error. In my previous post, I can modify the code in the ‘module.py’ so that the project does not report an error, but you said that the error still exists, but I want to know if this still existing error will affect my purpose of using hooks, that is, modifying the layer output results.

albanD · November 13, 2019, 1:42am

I am not sure exactly what will happen here.
Most likely, since the hook list was copied before, all the original hooks will run irrelevant of how you modify them during the for loop.

The bigger problem is that it is going to be complex for you to keep changing the module.py file to hack around the problem in your code. In the mid/long term, it will be more time efficient I think to fix your code.

Steve_Hu · November 13, 2019, 1:53am

right, i am tring to find the problems in my code…thx all right

Steve_Hu · November 19, 2019, 9:34am

the hook problem ,it seems a runtime error, it can run 250 iteration and suddenly error…i think it’s not my code error.

albanD · November 19, 2019, 2:34pm

If you give me a code sample that reproduces the issue, I can take a closer look.