Cuda() randomly produces runtime error

In my code, there’re many places where a variable is transfered to the GPU with .cuda() call like

x = x.cuda()$

When I begin the training, the program will always crash at some time, but at different such calls randomly.

One example is like this:

h = h.cuda()
return CudaTransfer(device_id, async)(self)
return i.cuda(async=self.async)
return new_type(self.size()).copy_(self, async)

RuntimeError: cuda runtime error (59) : device-side assert triggered at /data/users/soumith/miniconda2/conda-bld/pytorch-0.1.10_1488755368782/work/torch/lib/THC/generic/THCTensorCopy.c:18

I really cannot understand what is going on here.

I also tried to catch the exception and check the variable before the .cuda() call. It seems the variable is normal.

Anyone can help?

A device-side assert is usually triggered when you are doing out of bounds indexing.

To get the exact location of crash, you can try to run your program after setting the environment variable

export CUDA_LAUNCH_BLOCKING=1
python myprogram.py
2 Likes

Thanks! It is an out-of-boundary error.