Distributed training gives nan loss but single GPU training is fine

When I train my network with a single GPU, the training process terminates successfully after 120 epochs. However, if I use two GPUs, I get nan loss after a dozen epochs. The only thing I change is the batch size. For single GPU I use a batch size of 2 and for 2 GPUs I use a batch size of 1 for each GPU. The other parameters are exactly the same. I also replace every batchnorm2d layer with a syncbatchnorm layer. Strangely, syncbatchnorm gives higher loss. What could be the possible reasons?

Could you please paste a code snippet to reproduce? Are you using DataParallel or DistributedDataParallel?

I use DDP. I enabled anomaly detection. Below is the message I get

/pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:57: UserWarning: Traceback of forward call that caused the error:
File “”, line 1, in
File “/usr/lib/python3.6/multiprocessing/spawn.py”, line 105, in spawn_main
exitcode = _main(fd)
File “/usr/lib/python3.6/multiprocessing/spawn.py”, line 118, in _main
return self._bootstrap()
File “/usr/lib/python3.6/multiprocessing/process.py”, line 258, in _bootstrap
self.run()
File “/usr/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/home/beinan/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 19, in _wrap
fn(i, *args)
File “/home/beinan/Desktop/pytorch-bcn/jupyter/train.py”, line 153, in main_worker
train(train_loader, model, criterion, optimizer, epoch, args)
File “/home/beinan/Desktop/pytorch-bcn/jupyter/train.py”, line 199, in train
output = model(images)
File “/home/beinan/.local/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 541, in call
result = self.forward(*input, **kwargs)
File “/home/beinan/.local/lib/python3.6/site-packages/apex/parallel/distributed.py”, line 560, in forward
result = self.module(*inputs, **kwargs)
File “/home/beinan/.local/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 541, in call
result = self.forward(*input, **kwargs)
File “…/bcn/models/semantic/resnet34.py”, line 295, in forward
x, encoder_features, encoder_feature = self.encoder(x)
File “/home/beinan/.local/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 541, in call
result = self.forward(*input, **kwargs)
File “…/bcn/models/semantic/resnet34.py”, line 224, in forward
x = self.bn(self.conv(x))
File “/home/beinan/.local/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 541, in call
result = self.forward(*input, **kwargs)
File “…/bcn/layers/conv.py”, line 38, in forward
groups=self.groups
File “/home/beinan/.local/lib/python3.6/site-packages/apex/amp/wrap.py”, line 28, in wrapper
return orig_fn(*new_args, **kwargs)

Traceback (most recent call last):
File “train.py”, line 314, in
main()
File “train.py”, line 66, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
File “/home/beinan/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 171, in spawn
while not spawn_context.join():
File “/home/beinan/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 118, in join
raise Exception(msg)
Exception:

– Process 1 terminated with the following error:
Traceback (most recent call last):
File “/home/beinan/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py”, line 19, in _wrap
fn(i, *args)
File “/home/beinan/Desktop/pytorch-bcn/jupyter/train.py”, line 153, in main_worker
train(train_loader, model, criterion, optimizer, epoch, args)
File “/home/beinan/Desktop/pytorch-bcn/jupyter/train.py”, line 208, in train
scaled_loss.backward()
File “/home/beinan/.local/lib/python3.6/site-packages/torch/tensor.py”, line 166, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/home/beinan/.local/lib/python3.6/site-packages/torch/autograd/init.py”, line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: Function ‘CudnnConvolutionBackward’ returned nan values in its 1th output.

This consistently happens after 90+ epochs but only if I use DDP. Single GPU single node does not have this problem. BTW, I train with fp16 precision. Is it possible that fp16 + DDP + SyncBatchNorm somehow leads to this?

Is it possible for the transformation below to cause any problem?

class RandomResizeCrop(object):

    def __init__(self, min_scale, max_scale, scale_step, output_size):

        self.scales = np.arange(min_scale, max_scale, scale_step)
        self.output_height, self.output_width = output_size

    def __call__(self, image, annotation):

        scale = np.random.choice(self.scales)
        
        image = cv2.resize(image, (0,0), fx=scale, fy=scale, interpolation=cv2.INTER_LINEAR)
        annotation = cv2.resize(annotation, (0,0), fx=scale, fy=scale, interpolation=cv2.INTER_NEAREST)

        input_height, input_width = image.shape[:2]

        row_pads = max(self.output_height - input_height, 0)
        col_pads = max(self.output_width - input_width, 0)

        top_pads = randint(0, row_pads)
        bot_pads = row_pads - top_pads

        left_pads = randint(0, col_pads)
        right_pads = col_pads - left_pads

        image = np.pad(image, ((top_pads,bot_pads),(left_pads,right_pads),(0,0)), mode='constant', constant_values=0)
        annotation = np.pad(annotation, ((top_pads,bot_pads),(left_pads,right_pads)), mode='constant', constant_values=255)

        y1 = randint(0, max(input_height - self.output_height, 0))
        y2 = y1 + self.output_height

        x1 = randint(0, max(input_width - self.output_width, 0))
        x2 = x1 + self.output_width

        return image[y1:y2,x1:x2], annotation[y1:y2,x1:x2]

It looks like the first convolution (operation) in resnet is causing nan. Is there any way some values of an image become nan after transformation?

I ran into the exact same problem.
Any chance that you have eventually found what was the problem?

Thanks!

1 Like

@royve In my experience fp16 can lead to nans due to limited precision.

Can you try without fp16?

I train with fp32.
I believe my problem was that I used an LBFGS optimizer. When I switched back to Adam the NaNs were gone. Maybe the learning rate for LBFGS should be substantially decreased, I didn’t try it though.