How to loop over all the variable in a nn.module

and register hooks to the gradient?

Thanks for your reply. Do you know how to get name or meaningful identifier of variable? I did not find a name field in variable.

Thanks for your reply. But this seems to only work on layer level?

Currently, I got weird nan error in backward pass. I need to investigate the gradient of every variable to see which part goes wrong.
Idealy something similar to the following:

def inves(name=''):
    def f(tensor):
        if np.isnan(torch.mean(tensor).data.cpu().numpy() ):
            print('gradient of {} is'.format(name))
            print(tensor)
            assert 0, 'nan gradient'
    return f

for key, var in Model.varibles().iteritems():
    var.register_hook(inves(key))

update: I fond using locals() can be a solution.

You could try that:

def register_nan_checks(model):
    def check_grad(module, grad_input, grad_output):
        if np.isnan(grad_input.data.numpy()):
            print('NaN gradient in ' + type(module).__name__)
    model.apply(lambda module: module.register_backward_hook(check_grad))

Thanks!
But register_nan_checks(mymodel) seems has no effects to the model.

Actually I forgot that grad_input is a tuple, so the code should be more like that:

def register_nan_checks(model):
    def check_grad(module, grad_input, grad_output):
        # print(module) you can add this to see that the hook is called
        if any(np.all(np.isnan(gi.data.numpy())) for gi in grad_input if gi is not None):
            print('NaN gradient in ' + type(module).__name__)
    model.apply(lambda module: module.register_backward_hook(check_grad))

I’ve just checked and it works for me. If I add an additional print, it will show all modules.

2 Likes

Thanks! I have just updated the code, but it still does not work. I used print(module), nothing is printed.

Can you show me the code? It works for me

1 Like

Hi, thanks for your reply.
the code is here: https://github.com/ypxie/pytorch-NeuCom/blob/master/tasks/Copy/train.py
It does not rely on additional dataset, so it should work with a quick git clone.

This line is the problem. Never use the .forward function directly, call your module like a function - ncomputer(input_data).

1 Like

Thanks, it works now!

I have encountered a problem about memory for the same code.
In gpu mode, it works fine. but when I switch to cpu, the memory consumption gradually blows up and eventually take all the memory.
What do you think might be the cause?
Thank you!

Do you use Conv2d in your model? CPU convolution uses a lot of memory

I just use rnn and memory network.
It does not only use a lot of memory, I monitor the memory usage, it gradually grows and after a short period of time, all the memory is eaten and I have to reboot my machine.

Hum, might be a memory leak on the CPU code then.
Can you isolate the problem in a minimal example?

Thanks for your comments. Currently, I have no clue which part goes wrong.

Why it’s not suggested to directly use forward, isn’t it just building a graph?

Because modules have to do some other things too (e.g. register hooks) and they do that in __call__.

I dont understand what the issue is, isn’t

    for W in mdl_sgd.parameters():
        W.data = W.data - eta*W.grad.data

what u do to loop over?

(also it seems ur title doesn’t reflect ur real question I suggest to update it, its confusing and unnecessary).