Loss.backward throwing CUDA Errors

chinmay5 · June 12, 2020, 2:32pm

Sorry that I am asking this again but I would need a confirmation before I try to switch to the nightly builds for the issue since they seem to be a bit unstable.

This is the code:-

@staticmethod
    def backward(ctx, grad_output):
         grad_label = grad_output.clone()
        num_ft = grad_output.shape[0]
        # grad_label.data.resize_(num_ft, 32, 41)
        lin_indices_3d, lin_indices_2d = ctx.saved_variables
        num_ind = lin_indices_3d.data[0]
        grad_label.data.view(num_ft, -1).index_copy_(1, lin_indices_2d.data[1:1 + num_ind],
                                                     torch.index_select(grad_output.data.contiguous().view(num_ft, -1),
                                                                        1, lin_indices_3d.data[1:1 + num_ind]))
        # raw_input('sdflkj')
        return grad_label, None, None, None

The error I get is:-

The reference issue is:-

One of the solutions I saw was:-

However, even after ensuring the input is contiguous I got the same error. I am still not sure about the actual reason for the error though

albanD · June 12, 2020, 2:48pm

Hi,

You can try to enable torch.autograd.set_detect_anomaly(True) to see which forward function corresponds to the failing one in the backward.

Also You can try to run the same code on CPU to see if you get a better error message.
Also, I would double check all the indices in the multiple indexing you do.
Finally. You should never use .data anymore. This will reduce the chance of weird bugs a lot

chinmay5 · June 12, 2020, 2:55pm

@albanD
The detetct_anamaly takes me to this code snippet:-

Which is related to MaxPool1D.

I think the indexing should be fine since this is a code snippet which worked on 0.4.0 version and I am trying to port it here.

Also, if I do not have to use .data what would be the alternative to ensure less bugs?

albanD · June 12, 2020, 3:09pm

Can you share the kernel size and image size that you give to the pooling layer?
Can you also add just after it:

def hook(grad):
    print("Inside hook:")
    print(grad.size())
    print(grad.stride())
imageft.register_hook(hook)

To know exactly the gradient properties.
Thanks!

For the .data, it depends what you use it for.
If you want to get a new tensor detached from the previous one wrt to the autograd, use .detach().
If you want to do ops that are not tracked by the autograd inplace, use the context manager with torch.no_grad().

chinmay5 · June 12, 2020, 3:15pm

@albanD
Here are the values:-

Input shape = torch.Size([128, 442368, 5])

Also output for the code snippet you asked:-

Inside hook:
torch.Size([128, 442368, 1])
(442368, 1, 1)

Also for the .data as you can see here, I am taking the gradients, reshaping it and doing a mask select. I am sorry but I am not sure to which of the mentioned categories would this fall to?

albanD · June 12, 2020, 3:40pm

Here are the values:-

And what is the kernel size?

Also for the .data as you can see here

If you have no reason to use it, you can just remove it.
It used to be useful when working with Variables. But now that they don’t exist, you can just remove the .data

chinmay5 · June 12, 2020, 3:42pm

Kernel Size is 5. I basically remove the last dimension

albanD · June 12, 2020, 3:47pm

Interesting.
Running the following works fine for me, does it work for you as well?

import torch

a = torch.rand([128, 442368, 5], device="cuda", requires_grad=True)

mod = torch.nn.MaxPool1d(5)

out = mod(a)

out.backward(torch.ones_like(out))

chinmay5 · June 12, 2020, 3:55pm

@albanD
Here is what I get when trying the same thing:-

albanD · June 12, 2020, 4:02pm

Ah ! Making progress

So with the Quadro GP100 and a source install I can’t reproduce that.

Do you see the same behavior on nightly build for this test script?
Which version of cuda are you using and how did you installed pytorch?

chinmay5 · June 12, 2020, 4:05pm

@albanD
I have not switched to nightly builds yet. I am using PyTorch 1.2 although I tried updating uptil 1.5 and error persists. In all the cases, I ended up using pip instllations.

As for the OS, it is Ubuntu 18.04 and the Cuda version is

albanD · June 12, 2020, 4:09pm

Thanks for all the info.

Can you try to create a temporary python virtual env, install nightly and try to run the simple repro script. above?

chinmay5 · June 12, 2020, 4:41pm

@albanD I created a new environment.

In the given case, I still get an error

chinmay5 · June 12, 2020, 5:16pm

I attached the results. Do you think I am making some mistake?

albanD · June 12, 2020, 5:17pm

@ptrblck do you have an RTX 2070 somewhere to try and repro this by any chance?

albanD · June 12, 2020, 5:19pm

I just found https://github.com/pytorch/pytorch/issues/38764 which seem to claim that this was fixed.
@ptrblck will know better!

chinmay5 · June 12, 2020, 5:44pm

I made a mistake. I was able to run the code snippet without error on the nightly build. I will try to see if it works on the entire code next and will update you asap

chinmay5 · June 12, 2020, 6:14pm

@albanD it works with the nightly version. Thank you so so much for all your support. I am pretty sure I could not have figured it out without your help

ptrblck · June 12, 2020, 7:27pm

Sorry for the delayed reply, but yes it should be fixed by now. Let us know, if you run into this issue again, please.

chinmay5 · June 12, 2020, 7:28pm

Just a query, when would a stable version be out. I am not sure about the stability of the nightly builds