Helllo,
I’m struggling to find the way to run a training on a single node, multi GPU. The host is a DGX-A100, and the A100 has been split with MIGs. I did allocate 2 MIGs for my experiment.
I took the code from pytorch examples - https://github.com/pytorch/examples/tree/main/distributed/ddp-tutorial-series
But it fails to run :
fix_jer@dgxa100:~/GIT/examples/distributed/ddp-tutorial-series$ torchrun --standalone --nnodes=1 --nproc-per-node=2 multigpu_torchrun.py 50 10
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Traceback (most recent call last):
File "/raid/home/fix_jer/.local/lib/python3.8/site-packages/torch/cuda/__init__.py", line 260, in _lazy_init
queued_call()
File "/raid/home/fix_jer/.local/lib/python3.8/site-packages/torch/cuda/__init__.py", line 145, in _check_capability
capability = get_device_capability(d)
File "/raid/home/fix_jer/.local/lib/python3.8/site-packages/torch/cuda/__init__.py", line 381, in get_device_capability
prop = get_device_properties(device)
File "/raid/home/fix_jer/.local/lib/python3.8/site-packages/torch/cuda/__init__.py", line 399, in get_device_properties
return _get_device_properties(device) # type: ignore[name-defined] RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch.
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "multigpu_torchrun.py", line 111, in <module> main(args.save_every, args.total_epochs, args.batch_size) File "multigpu_torchrun.py", line 95, in main ddp_setup() File "multigpu_torchrun.py", line 15, in ddp_setup torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))
File "/raid/home/fix_jer/.local/lib/python3.8/site-packages/torch/cuda/__init__.py", line 350, in set_device
torch._C._cuda_setDevice(device)
File "/raid/home/fix_jer/.local/lib/python3.8/site-packages/torch/cuda/__init__.py", line 264, in _lazy_init
raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug
to PyTorch.
CUDA call was originally invoked at:
[...]
Reading there or there, I saw it might be linked to the backend and tried either nccl or gloo with no sucess.
Actually, even trying to collect infos about the environment fails :
fix_jer@dgxa100:~/GIT/examples/distributed/ddp-tutorial-series$ python3 -m torch.utils.collect_env
Collecting environment information...
Traceback (most recent call last):
File "/raid/home/fix_jer/.local/lib/python3.8/site-packages/torch/cuda/__init__.py", line 260, in _lazy_init
queued_call()
File "/raid/home/fix_jer/.local/lib/python3.8/site-packages/torch/cuda/__init__.py", line 145, in _check_capability
capability = get_device_capability(d)
File "/raid/home/fix_jer/.local/lib/python3.8/site-packages/torch/cuda/__init__.py", line 381, in get_device_capability
prop = get_device_properties(device)
File "/raid/home/fix_jer/.local/lib/python3.8/site-packages/torch/cuda/__init__.py", line 399, in get_device_properties
return _get_device_properties(device) # type: ignore[name-defined]
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/raid/home/fix_jer/.local/lib/python3.8/site-packages/torch/utils/collect_env.py", line 602, in <module>
main()
File "/raid/home/fix_jer/.local/lib/python3.8/site-packages/torch/utils/collect_env.py", line 585, in main
output = get_pretty_env_info()
File "/raid/home/fix_jer/.local/lib/python3.8/site-packages/torch/utils/collect_env.py", line 580, in get_pretty_env_info
return pretty_str(get_env_info())
File "/raid/home/fix_jer/.local/lib/python3.8/site-packages/torch/utils/collect_env.py", line 451, in get_env_info
cuda_module_loading=get_cuda_module_loading_config(),
File "/raid/home/fix_jer/.local/lib/python3.8/site-packages/torch/utils/collect_env.py", line 406, in get_cuda_module_loading_config
torch.cuda.init()
File "/raid/home/fix_jer/.local/lib/python3.8/site-packages/torch/cuda/__init__.py", line 216, in init
_lazy_init()
File "/raid/home/fix_jer/.local/lib/python3.8/site-packages/torch/cuda/__init__.py", line 264, in _lazy_init
raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch.
CUDA call was originally invoked at:
[' File "/usr/lib/python3.8/runpy.py", line 185, in _run_module_as_main\n mod_name, mod_spec, code = _get_module_details(mod_name, _Error)\n', ' File "/usr/lib/python3.8/runpy.py", line 111, in _get_module_details\n __import__(pkg_name)\n', ' File "<frozen importlib._bootstrap>", line 991, in _find_and_load\n', ' File "<frozen importlib._bootstrap>", line 961, in _find_and_load_unlocked\n', ' File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed\n', ' File "<frozen importlib._bootstrap>", line 991, in _find_and_load\n', ' File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked\n', ' File "<frozen importlib._bootstrap>", line 671, in _load_unlocked\n', ' File "<frozen importlib._bootstrap_external>", line 848, in exec_module\n', ' File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed\n', ' File "/raid/home/fix_jer/.local/lib/python3.8/site-packages/torch/__init__.py", line 1146, in <module>\n _C._initExtension(manager_path())\n', ' File "<frozen importlib._bootstrap>", line 991, in _find_and_load\n', ' File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked\n', ' File "<frozen importlib._bootstrap>", line 671, in _load_unlocked\n', ' File "<frozen importlib._bootstrap_external>", line 848, in exec_module\n', ' File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed\n', ' File "/raid/home/fix_jer/.local/lib/python3.8/site-packages/torch/cuda/__init__.py", line 197, in <module>\n _lazy_call(_check_capability)\n', ' File "/raid/home/fix_jer/.local/lib/python3.8/site-packages/torch/cuda/__init__.py", line 195, in _lazy_call\n _queued_calls.append((callable, traceback.format_stack()))\n']
For the environment, here are some infos :
fix_jer@dgxa100:~/GIT/examples/distributed/ddp-tutorial-series$ python3 -c "import torch; print(torch.cuda.is_available())"
True
fix_jer@dgxa100:~/GIT/examples/distributed/ddp-tutorial-series$ python3 -c "import torch; print(torch.cuda.device_count())"
2
For the CUDA_VISIBLE_DEVICES :
fix_jer@dgxa100:~/GIT/examples/distributed/ddp-tutorial-series$ echo $CUDA_VISIBLE_DEVICES
0,1
and the nvidia-smi output :
fix_jer@dgxa100:~/GIT/examples/distributed/ddp-tutorial-series$ nvidia-smi
Thu Jul 20 16:43:59 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... Off | 00000000:01:00.0 Off | On |
| N/A 49C P0 57W / 275W | 45MiB / 81920MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... Off | 00000000:47:00.0 Off | On |
| N/A 49C P0 66W / 275W | 45MiB / 81920MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... Off | 00000000:81:00.0 Off | On |
| N/A 49C P0 58W / 275W | 45MiB / 81920MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA DGX Display Off | 00000000:C1:00.0 Off | N/A |
| 34% 45C P8 N/A / 50W | 1MiB / 4096MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM... Off | 00000000:C2:00.0 Off | On |
| N/A 48C P0 56W / 275W | 48MiB / 81920MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 7 0 0 | 6MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 8 0 1 | 6MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
and the libraries :
fix_jer@dgxa100:~/GIT/examples/distributed/ddp-tutorial-series$ python3 -m pip list | grep torch
pytorch-lightning 1.9.5
torch 2.0.1
torchmetrics 1.0.1
torchvision 0.15.2
fix_jer@dgxa100:~/GIT/examples/distributed/ddp-tutorial-series$ python3 --version
Python 3.8.10
Do you think this might be related to the way we configured the MIG or is there something that pops in your mind that I’m not doing the right way ?
Thanks for your help
edit: I forgot to mention the code is working with 1 process, 1MIG:
fix_jer@dgxa100:~$ srun --partition=interactive10 --gres=gpu:1g.10gb:1 --ntasks=1 --cpus-per-task=4 --pty bash
fix_jer@dgxa100:~/GIT/examples/distributed/ddp-tutorial-series$ torchrun --standalone --nnodes=1 --nproc-per-node=1 multigpu_torchrun.py 50 10
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
[GPU0] Epoch 0 | Batchsize: 32 | Steps: 64
Epoch 0 | Training snapshot saved at snapshot.pt
[GPU0] Epoch 1 | Batchsize: 32 | Steps: 64
[GPU0] Epoch 2 | Batchsize: 32 | Steps: 64
[GPU0] Epoch 3 | Batchsize: 32 | Steps: 64