How to loop over all the variable in a nn.module

ypxie · March 6, 2017, 4:57pm

and register hooks to the gradient?

ypxie · March 6, 2017, 5:30pm

Thanks for your reply. Do you know how to get name or meaningful identifier of variable? I did not find a name field in variable.

ypxie · March 6, 2017, 5:53pm

Thanks for your reply. But this seems to only work on layer level?

Currently, I got weird nan error in backward pass. I need to investigate the gradient of every variable to see which part goes wrong.
Idealy something similar to the following:

def inves(name=''):
    def f(tensor):
        if np.isnan(torch.mean(tensor).data.cpu().numpy() ):
            print('gradient of {} is'.format(name))
            print(tensor)
            assert 0, 'nan gradient'
    return f

for key, var in Model.varibles().iteritems():
    var.register_hook(inves(key))

ypxie · March 6, 2017, 7:23pm

update: I fond using locals() can be a solution.

apaszke · March 6, 2017, 7:54pm

You could try that:

def register_nan_checks(model):
    def check_grad(module, grad_input, grad_output):
        if np.isnan(grad_input.data.numpy()):
            print('NaN gradient in ' + type(module).__name__)
    model.apply(lambda module: module.register_backward_hook(check_grad))

ypxie · March 6, 2017, 9:02pm

Thanks!
But register_nan_checks(mymodel) seems has no effects to the model.

apaszke · March 6, 2017, 10:31pm

Actually I forgot that grad_input is a tuple, so the code should be more like that:

def register_nan_checks(model):
    def check_grad(module, grad_input, grad_output):
        # print(module) you can add this to see that the hook is called
        if any(np.all(np.isnan(gi.data.numpy())) for gi in grad_input if gi is not None):
            print('NaN gradient in ' + type(module).__name__)
    model.apply(lambda module: module.register_backward_hook(check_grad))

I’ve just checked and it works for me. If I add an additional print, it will show all modules.

ypxie · March 6, 2017, 10:53pm

Thanks! I have just updated the code, but it still does not work. I used print(module), nothing is printed.

apaszke · March 6, 2017, 11:01pm

Can you show me the code? It works for me

ypxie · March 7, 2017, 1:56am

Hi, thanks for your reply.
the code is here: https://github.com/ypxie/pytorch-NeuCom/blob/master/tasks/Copy/train.py
It does not rely on additional dataset, so it should work with a quick git clone.

apaszke · March 7, 2017, 11:06am

This line is the problem. Never use the .forward function directly, call your module like a function - ncomputer(input_data).

ypxie · March 7, 2017, 6:34pm

Thanks, it works now!

ypxie · March 7, 2017, 6:44pm

I have encountered a problem about memory for the same code.
In gpu mode, it works fine. but when I switch to cpu, the memory consumption gradually blows up and eventually take all the memory.
What do you think might be the cause?
Thank you!

fmassa · March 7, 2017, 6:48pm

Do you use Conv2d in your model? CPU convolution uses a lot of memory

ypxie · March 7, 2017, 6:51pm

I just use rnn and memory network.
It does not only use a lot of memory, I monitor the memory usage, it gradually grows and after a short period of time, all the memory is eaten and I have to reboot my machine.

fmassa · March 7, 2017, 7:01pm

Hum, might be a memory leak on the CPU code then.
Can you isolate the problem in a minimal example?

ypxie · March 7, 2017, 7:09pm

Thanks for your comments. Currently, I have no clue which part goes wrong.

ruotianluo · March 12, 2017, 2:38am

Why it’s not suggested to directly use forward, isn’t it just building a graph?

apaszke · March 12, 2017, 11:06am

Because modules have to do some other things too (e.g. register hooks) and they do that in __call__.

Brando_Miranda · August 9, 2017, 5:34pm

I dont understand what the issue is, isn’t

    for W in mdl_sgd.parameters():
        W.data = W.data - eta*W.grad.data

what u do to loop over?

(also it seems ur title doesn’t reflect ur real question I suggest to update it, its confusing and unnecessary).