Issue with cuda

I am having some weird issues with running my model on the GPU. I have a model that takes in multiple inputs, computes features, concats them and makes some predictions.

When I run
CUDA_VISIBLE_DEVICES=0 python -m supervised.model.train --lr 1e-3 --batch_size 32 --cuda
my code ends up with this error

Train Epoch: 1 [27/6204 (0%)] | lr 1.00e-03 | s/batch 0.010, 0.060, 0.006 | Losses (r, q, total) 27.5339, 21.0737, 48.6076
Train Epoch: 1 [28/6204 (0%)] | lr 1.00e-03 | s/batch 0.010, 0.058, 0.006 | Losses (r, q, total) 27.3787, nan, nan
Train Epoch: 1 [29/6204 (0%)] | lr 1.00e-03 | s/batch 0.010, 0.060, 0.006 | Losses (r, q, total) nan, nan, nan

or

Train Epoch: 1 [11/6204 (0%)] | lr 1.00e-03 | s/batch 0.010, 0.061, 0.005 | Losses (r, q, total) 29.9488, 23.4458, 53.3945
Train Epoch: 1 [12/6204 (0%)] | lr 1.00e-03 | s/batch 0.010, 0.061, 0.005 | Losses (r, q, total) inf, 23.1003, inf
Train Epoch: 1 [13/6204 (0%)] | lr 1.00e-03 | s/batch 0.010, 0.060, 0.005 | Losses (r, q, total) inf, 23.1479, inf
Train Epoch: 1 [14/6204 (0%)] | lr 1.00e-03 | s/batch 0.010, 0.061, 0.005 | Losses (r, q, total) inf, 22.9168, inf
Train Epoch: 1 [15/6204 (0%)] | lr 1.00e-03 | s/batch 0.011, 0.078, 0.005 | Losses (r, q, total) inf, 22.8032, inf
Train Epoch: 1 [16/6204 (0%)] | lr 1.00e-03 | s/batch 0.009, 0.087, 0.005 | Losses (r, q, total) inf, 22.7152, inf
Train Epoch: 1 [17/6204 (0%)] | lr 1.00e-03 | s/batch 0.010, 0.063, 0.005 | Losses (r, q, total) inf, 22.4327, inf
Train Epoch: 1 [18/6204 (0%)] | lr 1.00e-03 | s/batch 0.010, 0.061, 0.005 | Losses (r, q, total) inf, 22.4060, inf
Train Epoch: 1 [19/6204 (0%)] | lr 1.00e-03 | s/batch 0.009, 0.069, 0.004 | Losses (r, q, total) inf, inf, inf

If I use a larger batch size (128), I get

Train Epoch: 1 [16/1551 (1%)] | lr 1.00e-03 | s/batch 0.034, 0.185, 0.011 | Losses (r, q, total) 29.1557, 22.7390, 51.8947
Train Epoch: 1 [17/1551 (1%)] | lr 1.00e-03 | s/batch 0.034, 0.185, 0.011 | Losses (r, q, total) 29.0046, 22.5440, 51.5485
/b/wheel/pytorch-src/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [26,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
...
File "/usr/local/lib/python3.5/dist-packages/torch/autograd/_functions/tensor.py", line 310, in forward
    return torch.cat(inputs, self.dim)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /b/wheel/pytorch-src/torch/lib/THC/generic/THCTensorMath.cu:226

The code runs perfectly if I run it on the CPU. Moreover, if I use an extra flag CUDA_LAUNCH_BLOCKING=1 before the script, everything works just fine and nothing crashes, nothing goes to inf or nan.
What can the issue be and how do I fix it? This seems like a Pytorch kernel.

Versions:

Python 3.5.2
Cuda compilation tools, release 8.0, V8.0.44
Ubuntu 16.04.2 LTS
pip install http://download.pytorch.org/whl/cu80/torch-0.1.12.post2-cp35-cp35m-linux_x86_64.whl 

i think adding the CUDA_LAUNCH_BLOCKING=1 and everything working is probably a false-positive, where in that case random memory is just filled with zeros. Same with CPU (the illegal memory being hit is probably all zeros).

This is most definitely an index out of bounds issue (indexing with < 0 or > size ).
Can you run this repeatedly with CUDA_LAUNCH_BLOCKING=1 and see if you can get a good stack-trace to identify the location of your issue.

Turns out the issue was with h5py and DataLoader. Using 2 or more threads with DataLoader when reading from an hdf5 input results in corrupted data. The read data was randomly corrupted and that resulted in errors with the network, with losses going to nan, or performing illegal indexing.

1 Like