RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
The code works fine on cpu but doing multi-gpu training I get this error below.
I search eveyrwhere, it can’t find much on this error. It seems to I send out of bound indices, but it works perfectly fine on cpu though and with single gpu as well.
The issue only happens in multi-gpu.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda [](int)->auto::operator()(int)->auto: block: [106,0,0], thread: [18,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
0%| | 0/320000 [00:13<?, ?it/s]
Traceback (most recent call last):
File "train.py", line 176, in <module>
start(DATASET_PATH, VOCAB_PATH, PRETRAINED_MODEL_PATH)
File "train.py", line 124, in start
_forward_step(convHan, batch_t, batch_t_senders, lst_labels, posts_words_order, len_conv, len_posts, optimizer, criterion, compute_device)
File "train.py", line 58, in _forward_step
out = _net(var_data[0], var_data[1], var_data[3], var_data[4], var_data[5], training_mode=True) # batch_t, posts_words_order, len_conv, len_post
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/usr/local/lib/python3.7/dist-packages/torch/_utils.py", line 394, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/root/workspace/code/complaint_livechat/voice_model/nets_multi_sender.py", line 107, in forward
sent_embs = self.reply(packed_sents, emb_sender)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/root/workspace/code/complaint_livechat/voice_model/nets_multi_sender.py", line 63, in forward
rnn_sents, _ = self.rnn(packed_batch)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/rnn.py", line 562, in forward
self.num_layers, self.dropout, self.training, self.bidirectional)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED