Loss.backward throwing CUDA Errors

Sorry that I am asking this again but I would need a confirmation before I try to switch to the nightly builds for the issue since they seem to be a bit unstable.

This is the code:-

@staticmethod
    def backward(ctx, grad_output):
         grad_label = grad_output.clone()
        num_ft = grad_output.shape[0]
        # grad_label.data.resize_(num_ft, 32, 41)
        lin_indices_3d, lin_indices_2d = ctx.saved_variables
        num_ind = lin_indices_3d.data[0]
        grad_label.data.view(num_ft, -1).index_copy_(1, lin_indices_2d.data[1:1 + num_ind],
                                                     torch.index_select(grad_output.data.contiguous().view(num_ft, -1),
                                                                        1, lin_indices_3d.data[1:1 + num_ind]))
        # raw_input('sdflkj')
        return grad_label, None, None, None

The error I get is:-

The reference issue is:-

One of the solutions I saw was:-

However, even after ensuring the input is contiguous I got the same error. I am still not sure about the actual reason for the error though

Hi,

You can try to enable torch.autograd.set_detect_anomaly(True) to see which forward function corresponds to the failing one in the backward.

Also You can try to run the same code on CPU to see if you get a better error message.
Also, I would double check all the indices in the multiple indexing you do.
Finally. You should never use .data anymore. This will reduce the chance of weird bugs a lot :slight_smile:

@albanD
The detetct_anamaly takes me to this code snippet:-

Which is related to MaxPool1D.

I think the indexing should be fine since this is a code snippet which worked on 0.4.0 version and I am trying to port it here.

Also, if I do not have to use .data what would be the alternative to ensure less bugs?

Can you share the kernel size and image size that you give to the pooling layer?
Can you also add just after it:

def hook(grad):
    print("Inside hook:")
    print(grad.size())
    print(grad.stride())
imageft.register_hook(hook)

To know exactly the gradient properties.
Thanks!

For the .data, it depends what you use it for.
If you want to get a new tensor detached from the previous one wrt to the autograd, use .detach().
If you want to do ops that are not tracked by the autograd inplace, use the context manager with torch.no_grad().

@albanD
Here are the values:-

Input shape = torch.Size([128, 442368, 5])

Also output for the code snippet you asked:-

Inside hook:
torch.Size([128, 442368, 1])
(442368, 1, 1)

Also for the .data as you can see here, I am taking the gradients, reshaping it and doing a mask select. I am sorry but I am not sure to which of the mentioned categories would this fall to?

Here are the values:-

And what is the kernel size?

Also for the .data as you can see here

If you have no reason to use it, you can just remove it.
It used to be useful when working with Variables. But now that they don’t exist, you can just remove the .data :slight_smile:

Kernel Size is 5. I basically remove the last dimension

Interesting.
Running the following works fine for me, does it work for you as well?

import torch

a = torch.rand([128, 442368, 5], device="cuda", requires_grad=True)

mod = torch.nn.MaxPool1d(5)

out = mod(a)

out.backward(torch.ones_like(out))

@albanD
Here is what I get when trying the same thing:-

Ah ! Making progress :slight_smile:

So with the Quadro GP100 and a source install I can’t reproduce that.

Do you see the same behavior on nightly build for this test script?
Which version of cuda are you using and how did you installed pytorch?

@albanD
I have not switched to nightly builds yet. I am using PyTorch 1.2 although I tried updating uptil 1.5 and error persists. In all the cases, I ended up using pip instllations.

As for the OS, it is Ubuntu 18.04 and the Cuda version is
image

image

image

Thanks for all the info.

Can you try to create a temporary python virtual env, install nightly and try to run the simple repro script. above?

@albanD I created a new environment.

In the given case, I still get an error

I attached the results. Do you think I am making some mistake?

@ptrblck do you have an RTX 2070 somewhere to try and repro this by any chance? :smiley:

I just found https://github.com/pytorch/pytorch/issues/38764 which seem to claim that this was fixed.
@ptrblck will know better!

1 Like

I made a mistake. I was able to run the code snippet without error on the nightly build. I will try to see if it works on the entire code next and will update you asap

@albanD it works with the nightly version. Thank you so so much for all your support. I am pretty sure I could not have figured it out without your help

Sorry for the delayed reply, but yes it should be fixed by now. Let us know, if you run into this issue again, please.

1 Like

Just a query, when would a stable version be out. I am not sure about the stability of the nightly builds :frowning: