I run this code from Moco repo with my own image dataset.
On a single GPU machine it worked fine.
However, when I moved to another machine with the same properties, except for a multiple GPUs (same as the one in the single-card machine), the forward pass produced NaN.
The NaN appears in the first epoch, but not neccisarily in the first batches of it, so debugging it takes some time. When I tried to use a debugger, it looked like the augmentations output tensors with one NaN entry. Yet, when I restored the exact random augmentations offline on the same image it worked just fine. Moreover, I had some debugging sessions in which augmentations didn’t produce NaN, but the loss still was NaN, so I’m not sure the issue is the augmentations.
I use torch.autograd.set_detect_anomaly
, and the NaN output is constantly appears in the first backward step, implies there is not an issue with gradient exploding.
Also, I tried running the same code using this pytorch docker container on the multi-GPU host, but the same problem occurred.
Error
.
.
.
Epoch: [0][ 1220/30451] Time 13.149 ( 4.412) Data 9.291 ( 3.442) Loss 9.4586e+00 (1.0130e+01) Acc@1 15.62 ( 11.84) Acc@5 31.25 ( 25.07)
Epoch: [0][ 1230/30451] Time 0.437 ( 4.402) Data 0.000 ( 3.431) Loss 9.3715e+00 (1.0124e+01) Acc@1 21.09 ( 11.90) Acc@5 39.84 ( 25.18)
Epoch: [0][ 1240/30451] Time 15.713 ( 4.406) Data 11.483 ( 3.431) Loss 9.3461e+00 (1.0118e+01) Acc@1 14.06 ( 11.95) Acc@5 34.38 ( 25.26)
Epoch: [0][ 1250/30451] Time 0.443 ( 4.397) Data 0.000 ( 3.423) Loss 9.2996e+00 (1.0112e+01) Acc@1 21.09 ( 12.00) Acc@5 40.62 ( 25.34)
Epoch: [0][ 1260/30451] Time 12.634 ( 4.396) Data 11.158 ( 3.422) Loss 9.2990e+00 (1.0105e+01) Acc@1 19.53 ( 12.06) Acc@5 42.97 ( 25.45)
/home/my-username/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/autograd/__init__.py:197: UserWarning: Error detected in LogSoftmaxBackward0. Traceback of forward call that caused the error:
File "<string>", line 1, in <module>
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 129, in _main
return self._bootstrap(parent_sentinel)
File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/my-username/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/home/my-username/Projects/moco-v2/main_moco.py", line 290, in main_worker
train(train_loader, model, criterion, optimizer, epoch, args)
File "/home/my-username/Projects/moco-v2/main_moco.py", line 328, in train
loss = criterion(output, target)
File "/home/my-username/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/my-username/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/nn/modules/loss.py", line 1174, in forward
return F.cross_entropy(input, target, weight=self.weight,
File "/home/my-username/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/nn/functional.py", line 3026, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
File "/home/my-username/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/fx/traceback.py", line 57, in format_stack
return traceback.format_stack()
(Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:114.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
File "main_moco.py", line 432, in <module>
main()
File "main_moco.py", line 139, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
File "/home/my-username/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/my-username/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/home/my-username/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/my-username/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/home/my-username/Projects/moco-v2/main_moco.py", line 290, in main_worker
train(train_loader, model, criterion, optimizer, epoch, args)
File "/home/my-username/Projects/moco-v2/main_moco.py", line 341, in train
loss.backward()
File "/home/my-username/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/home/my-username/Projects/moco-v2/venv/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Function 'LogSoftmaxBackward0' returned nan values in its 0th output.
Single GPU environment
(venv) user@name:~/Projects/moco-v2$ python -V
Python 3.8.10
(venv) user@name:~/Projects/moco-v2$ pip freeze
certifi==2022.12.7
charset-normalizer==2.1.1
click==8.1.3
docker-pycreds==0.4.0
gitdb==4.0.10
GitPython==3.1.29
idna==3.4
numpy==1.23.5
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
pathtools==0.1.2
Pillow==9.3.0
promise==2.3
protobuf==4.21.11
psutil==5.9.4
PyYAML==6.0
requests==2.28.1
sentry-sdk==1.11.1
setproctitle==1.3.2
shortuuid==1.0.11
six==1.16.0
smmap==5.0.0
torch==1.13.0
torchvision==0.14.0
tqdm==4.64.1
typing-extensions==4.4.0
urllib3==1.26.13
wandb==0.13.6
(venv) user@name:~/Projects/moco-v2$ python
Python 3.8.10 (default, Nov 14 2022, 12:59:47)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.version.cuda
'11.7'
>>> torch.cuda.device_count()
1
>>> torch.cuda.get_device_properties()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: get_device_properties() missing 1 required positional argument: 'device'
>>> torch.cuda.get_device_properties(0)
_CudaDeviceProperties(name='NVIDIA GeForce RTX 3090', major=8, minor=6, total_memory=24259MB, multi_processor_count=82)
>>>
Multi GPU environment
(venv) user@name:~/Projects/moco-v2$ python -V
Python 3.8.10
(venv) user@name:~/Projects/moco-v2$ pip freeze
certifi==2022.12.7
charset-normalizer==2.1.1
click==8.1.3
docker-pycreds==0.4.0
gitdb==4.0.10
GitPython==3.1.29
idna==3.4
numpy==1.23.5
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
pathtools==0.1.2
Pillow==9.3.0
pkg_resources==0.0.0
promise==2.3
protobuf==4.21.11
psutil==5.9.4
PyYAML==6.0
requests==2.28.1
sentry-sdk==1.11.1
setproctitle==1.3.2
shortuuid==1.0.11
six==1.16.0
smmap==5.0.0
torch==1.13.0
torchvision==0.14.0
tqdm==4.64.1
typing_extensions==4.4.0
urllib3==1.26.13
wandb==0.13.6
(venv) user@name:~/Projects/moco-v2$ python
Python 3.8.10 (default, Nov 14 2022, 12:59:47)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.version.cuda
'11.7'
>>> torch.cuda.device_count
<functools._lru_cache_wrapper object at 0x7ff2afb44040>
>>> torch.cuda.device_count()
2
>>> torch.cuda.get_device_properties(0)
_CudaDeviceProperties(name='NVIDIA GeForce RTX 3090', major=8, minor=6, total_memory=24259MB, multi_processor_count=82)
>>> torch.cuda.get_device_properties(1)
_CudaDeviceProperties(name='NVIDIA GeForce RTX 3090', major=8, minor=6, total_memory=24258MB, multi_processor_count=82)
>>>