Segmentation fault with register_hook

krylea · May 7, 2018, 8:05pm

I am trying to debug some confusing problems with my modified sequence to sequence network. I have figured out that there is some kind of issue with the gradients becoming NaN, so I am trying to somehow store the gradients so I can look at the values at all the intermediate steps using something like pdb.

If I do not add a backward hook, my program crashes with the error I have been expecting (though it isn’t very informative, it just tells me something about a “device side assert error” from cuda). If I add any backward hook at all, I get a Segmentation Fault instead, before I get to my error checking.

For example, if I add this to my decoder class:

    def save_grad(self):
        def hook(grad):
            pass
        return hook

and add these lines to very end of my decoder’s forward pass:

        grads=[rnn_output, hidden, attn_weights, context, concat_output, output, mem_vec]
        names = ["rnn_output", "hidden", "attn_weights", "context", "concat_output", "output", "mem_vec"]

        for var,name in zip(grads,names):
            var.register_hook(self.save_grad(name))

(where rnn_output, hidden, etc… are various intermediate variables in the computation) then my program quickly terminates with the line “Segmentation Fault”. It produces no other traceback whatsoever.
I also tried this with pass replaced by a print statement, storing the values in a dictionary, etc… and all had the same problem. The problem also appeared when save_grad was a global function defined outside of my model class.

albanD · May 8, 2018, 10:01am

Hi,

First to get proper error message for device side assert you can try two things:

Run on CPU to get the full cpu error stack trace.
Run your code with CUDA_LAUNCH_BLOCKING=1 to get the actual assert message printed.

For the hook issue, could you provide a small code sample to reproduce the segfault please?

krylea · May 8, 2018, 2:34pm

Unfortunately it is not computationally feasible to run this on CPU. If I run the model with parameters that are too small the error does not appear, and even in the most scaled down case that I have tried where I can reproduce the error, running it on CPU would be at best highly inconvenient.
I have tried CUDA_LAUNCH_BLOCKING=1, and when I did that I got an error message that was several hundred lines long and very difficult to parse. It looks something like this:

/pytorch/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [50,0,0], thread: [30,0,0] Assertion srcIndex < srcSelectDimSize failed.
/pytorch/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [50,0,0], thread: [31,0,0] Assertion srcIndex < srcSelectDimSize failed.
Segmentation fault

except with that first part repeated several hundred times.

I have considered that the segmentation fault I saw there and the one I get when I try the hook are caused by the same thing but I don’t know enough about what could cause a segfault in PyTorch to know for sure. From what I have looked at it seems like the CUDA error I get happens somewhere during the forward pass when there are NaNs in my weights, and somewhere in there CUDA goes apeshit and segfaults. I am trying to track where the gradient becomes NaN in the previous backward pass before this happens, that is why I am attempting to use the backward hooks. I am not sure why/whether the hook might trigger the same issue with the segfault - though when I get it from the forward pass I get a massively long error message and when I get it from the hook it literally just says “Segmentation Fault” and nothing else.

Unfortunately giving a small code sample will also be difficult as the code base for this model is running in the thousands of lines now. I could try just giving all the code for the forward pass? That would still be very long though.

albanD · May 8, 2018, 2:40pm

The error in cuda here is that you index a tensor with an index which is too big for the dimension you selected.
If there are a few hundreds of them, that means that there are a many invalid indices in this indexing (possibly nan values).
One thing you could do is to add some checks in your code before every indexing operation to check that the index values are not too big, like assert indices.max() < to_index.size(index_dim). That will help you see where the error happens first.

For the segfault, there have been some issue with the 0.3.x binaries where the printing of the stacktrace and error message were causing segfault (that is why you don’t see any proper python stacktrace).
If you are using these binaries, you could upgrade to the latest version to remove the segfault and get a proper stacktrace.

krylea · May 8, 2018, 2:43pm

When I googled the error I did see that it was related somehow to indexing, but I somehow doubt that is the actual problem. The code runs for most of an epoch before it crashes, and when I add checks to see what is happening with the gradients, my checks show that the gradients are becoming NaN before this error gets thrown. I suspect whatever the CUDA error is is just a symptom of the gradients/weights becoming NaN, and whatever is causing that is the actual problem.

Ah, that is interesting. I will definitely try that then. 0.4.x fixes that issue?

albanD · May 8, 2018, 2:51pm

Yes this has been fixed in the new binaries.