I am having some weird issues with running my model on the GPU. I have a model that takes in multiple inputs, computes features, concats them and makes some predictions.
When I run
CUDA_VISIBLE_DEVICES=0 python -m supervised.model.train --lr 1e-3 --batch_size 32 --cuda
my code ends up with this error
Train Epoch: 1 [27/6204 (0%)] | lr 1.00e-03 | s/batch 0.010, 0.060, 0.006 | Losses (r, q, total) 27.5339, 21.0737, 48.6076
Train Epoch: 1 [28/6204 (0%)] | lr 1.00e-03 | s/batch 0.010, 0.058, 0.006 | Losses (r, q, total) 27.3787, nan, nan
Train Epoch: 1 [29/6204 (0%)] | lr 1.00e-03 | s/batch 0.010, 0.060, 0.006 | Losses (r, q, total) nan, nan, nan
or
Train Epoch: 1 [11/6204 (0%)] | lr 1.00e-03 | s/batch 0.010, 0.061, 0.005 | Losses (r, q, total) 29.9488, 23.4458, 53.3945
Train Epoch: 1 [12/6204 (0%)] | lr 1.00e-03 | s/batch 0.010, 0.061, 0.005 | Losses (r, q, total) inf, 23.1003, inf
Train Epoch: 1 [13/6204 (0%)] | lr 1.00e-03 | s/batch 0.010, 0.060, 0.005 | Losses (r, q, total) inf, 23.1479, inf
Train Epoch: 1 [14/6204 (0%)] | lr 1.00e-03 | s/batch 0.010, 0.061, 0.005 | Losses (r, q, total) inf, 22.9168, inf
Train Epoch: 1 [15/6204 (0%)] | lr 1.00e-03 | s/batch 0.011, 0.078, 0.005 | Losses (r, q, total) inf, 22.8032, inf
Train Epoch: 1 [16/6204 (0%)] | lr 1.00e-03 | s/batch 0.009, 0.087, 0.005 | Losses (r, q, total) inf, 22.7152, inf
Train Epoch: 1 [17/6204 (0%)] | lr 1.00e-03 | s/batch 0.010, 0.063, 0.005 | Losses (r, q, total) inf, 22.4327, inf
Train Epoch: 1 [18/6204 (0%)] | lr 1.00e-03 | s/batch 0.010, 0.061, 0.005 | Losses (r, q, total) inf, 22.4060, inf
Train Epoch: 1 [19/6204 (0%)] | lr 1.00e-03 | s/batch 0.009, 0.069, 0.004 | Losses (r, q, total) inf, inf, inf
If I use a larger batch size (128), I get
Train Epoch: 1 [16/1551 (1%)] | lr 1.00e-03 | s/batch 0.034, 0.185, 0.011 | Losses (r, q, total) 29.1557, 22.7390, 51.8947
Train Epoch: 1 [17/1551 (1%)] | lr 1.00e-03 | s/batch 0.034, 0.185, 0.011 | Losses (r, q, total) 29.0046, 22.5440, 51.5485
/b/wheel/pytorch-src/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [26,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
...
File "/usr/local/lib/python3.5/dist-packages/torch/autograd/_functions/tensor.py", line 310, in forward
return torch.cat(inputs, self.dim)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /b/wheel/pytorch-src/torch/lib/THC/generic/THCTensorMath.cu:226
The code runs perfectly if I run it on the CPU. Moreover, if I use an extra flag CUDA_LAUNCH_BLOCKING=1
before the script, everything works just fine and nothing crashes, nothing goes to inf or nan.
What can the issue be and how do I fix it? This seems like a Pytorch kernel.
Versions:
Python 3.5.2
Cuda compilation tools, release 8.0, V8.0.44
Ubuntu 16.04.2 LTS
pip install http://download.pytorch.org/whl/cu80/torch-0.1.12.post2-cp35-cp35m-linux_x86_64.whl