RuntimeError: CUDNN_STATUS_EXECUTION_FAILED

Hi,
I was recently trying to implement my own version of the contrastive loss function, awhich started failing after the latest cuda update.
It is a centos6 cluster with latest pytorch compiled with cuda8.0/cudnn8.0. The same code worked 2 weeks ago before nvidia driver update and on the older pytorch bulid.

Specifically if I run the code with CUDA_LAUNCH_BLOCKING=1, I get the following stack trace:

Traceback (most recent call last):
  File "src/main.py", line 147, in <module>
    train(batch_logger=train_batch_logger)
  File "/home/ifs-users/bjuncek/thesis_working/src/train.py", line 40, in train_epoch
    loss.backward()
  File "/home/ifs-users/bjuncek/bin/miniconda3/envs/pytorch03/lib/python3.6/site-packages/torch/autograd/variable.py", line 167, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
  File "/home/ifs-users/bjuncek/bin/miniconda3/envs/pytorch03/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
    variables, grad_variables, retain_graph)
RuntimeError: CUDNN_STATUS_EXECUTION_FAILED

If I however add torch.backends.cudnn.benchmark=True to my script, I end up with the stack trace bellow.

Traceback (most recent call last):
  File "src/main.py", line 148, in <module>
    train(batch_logger=train_batch_logger)
  File "/home/ifs-users/bjuncek/thesis_working/src/train.py", line 40, in train_epoch
    loss.backward()
  File "/home/ifs-users/bjuncek/bin/miniconda3/envs/pytorch03/lib/python3.6/site-packages/torch/autograd/variable.py", line 167, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
  File "/home/ifs-users/bjuncek/bin/miniconda3/envs/pytorch03/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
    variables, grad_variables, retain_graph)
RuntimeError: cublas runtime error : the GPU program failed to execute at /opt/conda/conda-bld/pytorch_1518243271935/work/torch/lib/THC/THCBlas.cu:247