Training MoCo works on a single GeForce RTX 3090 GPU but fails on multi GPU due to NaN in training loop

yuvfried · December 15, 2022, 3:20pm

I run this code from Moco repo with my own image dataset.

On a single GPU machine it worked fine.
However, when I moved to another machine with the same properties, except for a multiple GPUs (same as the one in the single-card machine), the forward pass produced NaN.

The NaN appears in the first epoch, but not neccisarily in the first batches of it, so debugging it takes some time. When I tried to use a debugger, it looked like the augmentations output tensors with one NaN entry. Yet, when I restored the exact random augmentations offline on the same image it worked just fine. Moreover, I had some debugging sessions in which augmentations didn’t produce NaN, but the loss still was NaN, so I’m not sure the issue is the augmentations.

I use torch.autograd.set_detect_anomaly, and the NaN output is constantly appears in the first backward step, implies there is not an issue with gradient exploding.

Also, I tried running the same code using this pytorch docker container on the multi-GPU host, but the same problem occurred.

Error

.
.
.
Epoch: [0][ 1220/30451]	Time 13.149 ( 4.412)	Data  9.291 ( 3.442)	Loss 9.4586e+00 (1.0130e+01)	Acc@1  15.62 ( 11.84)	Acc@5  31.25 ( 25.07)
Epoch: [0][ 1230/30451]	Time  0.437 ( 4.402)	Data  0.000 ( 3.431)	Loss 9.3715e+00 (1.0124e+01)	Acc@1  21.09 ( 11.90)	Acc@5  39.84 ( 25.18)
Epoch: [0][ 1240/30451]	Time 15.713 ( 4.406)	Data 11.483 ( 3.431)	Loss 9.3461e+00 (1.0118e+01)	Acc@1  14.06 ( 11.95)	Acc@5  34.38 ( 25.26)
Epoch: [0][ 1250/30451]	Time  0.443 ( 4.397)	Data  0.000 ( 3.423)	Loss 9.2996e+00 (1.0112e+01)	Acc@1  21.09 ( 12.00)	Acc@5  40.62 ( 25.34)
Epoch: [0][ 1260/30451]	Time 12.634 ( 4.396)	Data 11.158 ( 3.422)	Loss 9.2990e+00 (1.0105e+01)	Acc@1  19.53 ( 12.06)	Acc@5  42.97 ( 25.45)
/home/my-username/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/autograd/__init__.py:197: UserWarning: Error detected in LogSoftmaxBackward0. Traceback of forward call that caused the error:
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 129, in _main
    return self._bootstrap(parent_sentinel)
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/my-username/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/my-username/Projects/moco-v2/main_moco.py", line 290, in main_worker
    train(train_loader, model, criterion, optimizer, epoch, args)
  File "/home/my-username/Projects/moco-v2/main_moco.py", line 328, in train
    loss = criterion(output, target)
  File "/home/my-username/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/my-username/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/nn/modules/loss.py", line 1174, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/home/my-username/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/nn/functional.py", line 3026, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
  File "/home/my-username/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/fx/traceback.py", line 57, in format_stack
    return traceback.format_stack()
 (Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:114.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
  File "main_moco.py", line 432, in <module>
    main()
  File "main_moco.py", line 139, in main
    mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
  File "/home/my-username/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/my-username/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/my-username/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/my-username/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/my-username/Projects/moco-v2/main_moco.py", line 290, in main_worker
    train(train_loader, model, criterion, optimizer, epoch, args)
  File "/home/my-username/Projects/moco-v2/main_moco.py", line 341, in train
    loss.backward()
  File "/home/my-username/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/home/my-username/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Function 'LogSoftmaxBackward0' returned nan values in its 0th output.

Single GPU environment

(venv) user@name:~/Projects/moco-v2$ python -V
Python 3.8.10
(venv) user@name:~/Projects/moco-v2$ pip freeze
certifi==2022.12.7
charset-normalizer==2.1.1
click==8.1.3
docker-pycreds==0.4.0
gitdb==4.0.10
GitPython==3.1.29
idna==3.4
numpy==1.23.5
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
pathtools==0.1.2
Pillow==9.3.0
promise==2.3
protobuf==4.21.11
psutil==5.9.4
PyYAML==6.0
requests==2.28.1
sentry-sdk==1.11.1
setproctitle==1.3.2
shortuuid==1.0.11
six==1.16.0
smmap==5.0.0
torch==1.13.0
torchvision==0.14.0
tqdm==4.64.1
typing-extensions==4.4.0
urllib3==1.26.13
wandb==0.13.6
(venv) user@name:~/Projects/moco-v2$ python
Python 3.8.10 (default, Nov 14 2022, 12:59:47)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.version.cuda
'11.7'
>>> torch.cuda.device_count()
1
>>> torch.cuda.get_device_properties()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: get_device_properties() missing 1 required positional argument: 'device'
>>> torch.cuda.get_device_properties(0)
_CudaDeviceProperties(name='NVIDIA GeForce RTX 3090', major=8, minor=6, total_memory=24259MB, multi_processor_count=82)
>>>

Multi GPU environment

(venv) user@name:~/Projects/moco-v2$ python -V
Python 3.8.10
(venv) user@name:~/Projects/moco-v2$ pip freeze
certifi==2022.12.7
charset-normalizer==2.1.1
click==8.1.3
docker-pycreds==0.4.0
gitdb==4.0.10
GitPython==3.1.29
idna==3.4
numpy==1.23.5
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
pathtools==0.1.2
Pillow==9.3.0
pkg_resources==0.0.0
promise==2.3
protobuf==4.21.11
psutil==5.9.4
PyYAML==6.0
requests==2.28.1
sentry-sdk==1.11.1
setproctitle==1.3.2
shortuuid==1.0.11
six==1.16.0
smmap==5.0.0
torch==1.13.0
torchvision==0.14.0
tqdm==4.64.1
typing_extensions==4.4.0
urllib3==1.26.13
wandb==0.13.6
(venv) user@name:~/Projects/moco-v2$ python
Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.version.cuda
'11.7'
>>> torch.cuda.device_count
<functools._lru_cache_wrapper object at 0x7ff2afb44040>
>>> torch.cuda.device_count()
2
>>> torch.cuda.get_device_properties(0)
_CudaDeviceProperties(name='NVIDIA GeForce RTX 3090', major=8, minor=6, total_memory=24259MB, multi_processor_count=82)
>>> torch.cuda.get_device_properties(1)
_CudaDeviceProperties(name='NVIDIA GeForce RTX 3090', major=8, minor=6, total_memory=24258MB, multi_processor_count=82)
>>>

ptrblck · December 15, 2022, 7:32pm

It’s a bit tricky to give valid advice for debugging as it seems the issue you are seeing is not deterministic and seems to how up randomly.
Could you rerun your workload with compute-sanitizer (with memcheck and racecheck) and see if it would raise any errors? This would be the first step to isolate the issue further in case a kernel is misbehaving.

yuvfried · December 28, 2022, 1:42pm

Thanks @ptrblck for replying,

I tried memcheck and racecheck, but in both runs I got this message in the head of the output:

========= Error: No attachable process found. compute-sanitizer timed-out.
========= Default timeout can be adjusted with --launch-timeout. Awaiting target completion.

And then the training starts and after some time is stops in the same way I described in the post above.

When I set --launch-timeout 600 the message is same, and when --launch-timeout 0 (infinity) the training don’t even start.

yuvfried · January 10, 2023, 2:59pm

After a long debugging sessions we discovered a physical hardware defect.
Fixing it solved the problem.