Multi-gpu training gives out of bound indices, but cpu and single gpu work fine

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

The code works fine on cpu but doing multi-gpu training I get this error below.
I search eveyrwhere, it can’t find much on this error. It seems to I send out of bound indices, but it works perfectly fine on cpu though and with single gpu as well.

The issue only happens in multi-gpu.

/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda [](int)->auto::operator()(int)->auto: block: [106,0,0], thread: [18,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
  0%|                                                                                                                          | 0/320000 [00:13<?, ?it/s]
Traceback (most recent call last):
  File "train.py", line 176, in <module>
    start(DATASET_PATH, VOCAB_PATH, PRETRAINED_MODEL_PATH)
  File "train.py", line 124, in start
    _forward_step(convHan, batch_t, batch_t_senders, lst_labels, posts_words_order, len_conv, len_posts, optimizer, criterion, compute_device)
  File "train.py", line 58, in _forward_step
    out = _net(var_data[0], var_data[1], var_data[3], var_data[4], var_data[5], training_mode=True)  # batch_t, posts_words_order, len_conv, len_post
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/usr/local/lib/python3.7/dist-packages/torch/_utils.py", line 394, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/workspace/code/complaint_livechat/voice_model/nets_multi_sender.py", line 107, in forward
    sent_embs = self.reply(packed_sents, emb_sender)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/workspace/code/complaint_livechat/voice_model/nets_multi_sender.py", line 63, in forward
    rnn_sents, _ = self.rnn(packed_batch)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/rnn.py", line 562, in forward
    self.num_layers, self.dropout, self.training, self.bidirectional)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Could you rerun the code with CUDA_LAUNCH_BLOCKING=1 python script.py args and post the stack trace here?
The cudnn error seems to be a red herring, as an indexing kernel seems to be creating the actual error.
Also, if you are on an older PyTorch version, could you update it to the latest stable version?