Training stops without any error in loss.backward() while using GPU

timoth-y · December 25, 2019, 6:08pm

Trying to use Cuda to train model. Already passed model and input to cuda device:

if use_gpu: 
	self.net.cuda();
criterion = nn.CrossEntropyLoss();
optimizer = optim.Adam(self.net.parameters(), lr=1e-3);

inputs, labels = data;
if use_gpu:
	inputs = inputs.cuda();
	labels = labels.cuda();

But when it executes loss.backward() it immediately crashes without any error.

outputs = self.net(inputs);
loss = criterion(outputs, labels);
loss.backward();

I am using pytorch version 1.3.1+cu92 with NVIDIA GeForce RTX 2060
Already tried to install cuda driver.
Any tips?

ptrblck · December 25, 2019, 8:18pm

Could you try to use CUDA10 with your Turing card, as older CUDA versions might not work properly with your GPU.
Also, you should have received a warning regarding this mismatch, but maybe it was hidden by the crash.

timoth-y · December 30, 2019, 8:17pm

The problem was in mismatch of CUDA and Pytorch versions. I had Pytorch -v: 1.3.1+cu92.
However, when I tried run training with installed CUDA 9.2 I’ve had the similar problem.
It seems like Turing cards is only supported by CUDA 10.X:

CUDA 10 is the first version of CUDA to support the new NVIDIA Turing architecture
CUDA 10 Features Revealed: Turing, CUDA Graphs, and More | NVIDIA Technical Blog

With the latest version of CUDA and Pytorch everything works correctly and most importantly fast!
Thanks