Hi,
I’m trying to train an InternImage model and I’m running into a weird issue using either Torchrun or torch.distributed.launch
I’m using the PyTorch Nvidia docker 22.04 with CUDA 11.6.2 and PyTorch 1.12.
When I’m training the model using “python train.py …” the model runs well, however, when I try to take advantage of my 2 GPUs, I"m running into problems.
I can see that both my GPUs are running since their temperature is going up and their memory usage is going up, but while training the model with 1 GPU takes me 2.5 days, using the 2 GPUs, the model ETA is > 3.5 days.
I don’t care that much about the time, but the main problem is that the training crashes after 1000 iterations, when it’s saving the first checkpoint. None of these issues is observed when I use 1 GPU.
I also looked at the status when the crash happens, there’s no RAM or GPU memory usage issue.
I’ll appreciate your help troubleshooting this problem.
I’m using WSL2, with Nvidia docker 22.04, and I have two RTX3090 GPUs.
This is the error I’m getting:
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 19682 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 19681) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File “/opt/conda/bin/torchrun”, line 33, in
sys.exit(load_entry_point(‘torch==1.12.0a0+bd13bc6’, ‘console_scripts’, ‘torchrun’)())
File “/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 345, in wrapper
return f(*args, **kwargs)
File “/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py”, line 761, in main
run(args)
File “/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py”, line 752, in run
elastic_launch(
File “/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2023-04-06_21:18:35
host : f56b97aeeef1
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 19681)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 19681
root@f56b97aeeef1:/workspace/InternImage/segmentation# /opt/conda/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 38 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d ’
/opt/conda/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 38 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d ’