Error in CUDNN status when calling batch normalization, python=3.6, pytorch=0.4.0, cuda 9

teresa · February 25, 2022, 11:01am

In a project with python=3.6 and pytorch=0.4.0, I am getting the followiong error when running the model:

File “net.py”, line 150, in forward
x = self.bn1(x)
File “/home/…/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 491, in call
result = self.forward(*input, **kwargs)
File “/home/…/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py”, line 49, in forward
self.training or not self.track_running_stats, self.momentum, self.eps)
File “/home/…/lib/python3.6/site-packages/torch/nn/functional.py”, line 1194, in batch_norm
training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDNN_STATUS_EXECUTION_FAILED

Do you know why that can be? Thank you

teresa · February 25, 2022, 2:35pm

After some research the problem is that the RTX 3090 is not compatible with CUDA 9. Neither is the RTX 2080 compatible. Those are all my training options. Removing the CUDA references from the training code fixes the problem and right now it is training on the CPU.

Some research about Pytorch roadmap shows that the version 0.4.0 will never be compatible with CUDA 10 or 11 that are the GPUs in my lab.

Training in the CPU will take more than 50 days and it is not a realistic option.

Thinking about updating the Pytorch version to a one compatible with at least CUDA 10. Pytorch v. 1.0.0 would be enough.

Do you have any recommendation for doing this in the fastest way possible?

ptrblck · February 25, 2022, 9:59pm

For your 3090 you would need CUDA 11.x, so I would recommend to update to the latest stable or nightly release with CUDA 11.3 or 11.5.

teresa · March 2, 2022, 3:11pm

Thank you. I have now updated to CUDA 11 and most of the repo works. I just have one piece that doesn’t. The full error message is is the following:

File “train_continue.py”, line 201, in
fire.Fire()
File “/home/…/lib/python3.6/site-packages/fire/core.py”, line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File “/home/…/lib/python3.6/site-packages/fire/core.py”, line 471, in _Fire
target=component.name)
File “/home/…/lib/python3.6/site-packages/fire/core.py”, line 681, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File “train_continue.py”, line 120, in train
loss = criterion(score_connect, target)
File “/home/…/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 1102, in _call_impl
return forward_call(*input, **kwargs)
File “/home/…/lib/python3.6/site-packages/torch/nn/modules/loss.py”, line 1152, in forward
label_smoothing=self.label_smoothing)
File “/home/…/lib/python3.6/site-packages/torch/nn/functional.py”, line 2846, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: “nll_loss_forward_reduce_cuda_kernel_2d_index” not implemented for ‘Bool’

And the code around line 120 of the script is:
for i, (input, target, existing) in tqdm(enumerate(train_dataloader)):

        input = input.cuda()

        target = target.cuda()

        optimizer.zero_grad()

        score_model = model(input)

        existing = existing.cuda()

        score_model = t.cat([score_model, existing], 1)

        score_connect = connect(score_model)

        loss = criterion(score_connect, target)

        loss.backward()

        optimizer.step()

Do you know why this might happen?

ptrblck · March 3, 2022, 2:05am

nn.CrossEntropyLoss expects logits in the shape of [batch_size, nb_classes, *] as a FloatTensor and targets in the shape [batch_size, *] containing class indices in the range [0, nb_classes-1] as a LongTensor (in the multi-class setup).
In your use case you are using bool values, which is not supported.