Training runs 50% slower when using 2 GPUs comparing to 1

fabiansc · April 6, 2023, 9:22pm

Hi,

I’m trying to train an InternImage model and I’m running into a weird issue using either Torchrun or torch.distributed.launch

I’m using the PyTorch Nvidia docker 22.04 with CUDA 11.6.2 and PyTorch 1.12.
When I’m training the model using “python train.py …” the model runs well, however, when I try to take advantage of my 2 GPUs, I"m running into problems.

I can see that both my GPUs are running since their temperature is going up and their memory usage is going up, but while training the model with 1 GPU takes me 2.5 days, using the 2 GPUs, the model ETA is > 3.5 days.

I don’t care that much about the time, but the main problem is that the training crashes after 1000 iterations, when it’s saving the first checkpoint. None of these issues is observed when I use 1 GPU.

I also looked at the status when the crash happens, there’s no RAM or GPU memory usage issue.

I’ll appreciate your help troubleshooting this problem.

I’m using WSL2, with Nvidia docker 22.04, and I have two RTX3090 GPUs.

This is the error I’m getting:

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 19682 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 19681) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File “/opt/conda/bin/torchrun”, line 33, in
sys.exit(load_entry_point(‘torch==1.12.0a0+bd13bc6’, ‘console_scripts’, ‘torchrun’)())
File “/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 345, in wrapper
return f(*args, kwargs)
File “/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py”, line 761, in main
run(args)
File “/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py”, line 752, in run
elastic_launch(
File “/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 131, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File “/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-04-06_21:18:35
host : f56b97aeeef1
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 19681)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 19681

root@f56b97aeeef1:/workspace/InternImage/segmentation# /opt/conda/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 38 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d ’
/opt/conda/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 38 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d ’

ptrblck · April 6, 2023, 11:25pm

I don’t see a clear error being raised besides the SIGKILL, which could generally indicate that you might be running out of host RAM.
Are you seeing the same issues using a newer container or the current 2.0.0 binaries?
Also, you could try to rerun the code with more debugging flags mentioned e.g. here:

NCCL_DEBUG=INFO TORCH_DISTRIBUTED_DEBUG=INFO TORCH_SHOW_CPP_STACKTRACES=1 torchrun ...

to get more information about the failure.

fabiansc · April 9, 2023, 1:00pm

Thanks for your response!

I followed the Performance tab in Task Manager (Windows) and I didn’t observe any memory spike, it’s constantly around 60%.I can give a newer version a try, but the model I’m trying to run has binaries adapted up to CUDA 11.6 and Torch 1.12. (That’s doesn’t say it wouldn’t work with newer versions of course)

The current 2.0.0 binaries gave me a hard time with converting models to ONNX, so I’m waiting for ONNX to catch up in order to go back to using them.

I ran the code with all the flags mentioned, and here’s the output I get now:

2023-04-09 10:40:57,962 - mmseg - INFO - Iter [950/160000] lr: 1.059e-06, eta: 3 days, 8:30:44, time: 1.835, data_time: 0.059, memory: 17117, decode.loss_ce: 0.0644, decode.acc_seg: 99.1115, aux.loss_ce: 0.0824, aux.acc_seg: 98.4049, loss: 0.1468, grad_norm: 1.3862
2023-04-09 10:42:29,634 - mmseg - INFO - Saving checkpoint at 1000 iterations
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 12289 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 12288) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File “/opt/conda/bin/torchrun”, line 33, in
sys.exit(load_entry_point(‘torch==1.12.0a0+bd13bc6’, ‘console_scripts’, ‘torchrun’)())
File “/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 345, in wrapper
return f(*args, kwargs)
File “/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py”, line 761, in main
run(args)
File “/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py”, line 752, in run
elastic_launch(
File “/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 131, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File “/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-04-09_10:43:14
host : f56b97aeeef1
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 12288)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 12288

/opt/conda/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 38 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d ’
root@f56b97aeeef1:/workspace/InternImage/segmentation# /opt/conda/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 38 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d ’

It seems to be stuck, it didn’t got me back to command line but rather it’s stuck with the “warngings.warn…”

Is there anything else to try before trying to use a more recent container? If a newer container doesn’t fix the problem, what else cold cause this?

ptrblck · April 9, 2023, 5:53pm

I would try to narrow down why the SIGKILL is sent to the process as it still seems as if you would be running out of host RAM.
dmesg can give you often more information and might show which process killed your training.

fabiansc · April 11, 2023, 9:44pm

OK, so I agree it’s a memory issue.
I took the same model and the “base” version of it didn’t crash when training with 2 GPUs.The problem is with heavier “XL” version.

I also dug a bit deeper. It’s not related to the checkpoint save since the checkpoint is created well. It crashes right as the model is about to start the validation evaluation step.

Is there anything that can be done to fix it?
This is the repeating error on dmesg:

misc dxg: dxgk: dxgkio_reserve_gpu_va: Ioctl failed: -75

Is there any way to reduce the memory load while running evaluation?

ptrblck · April 12, 2023, 12:12am

The error message seems to be raised by WSL2 and I guess WSL2 itself might be running into trouble once you are running out of host RAM.
You could try to delete unneeded objects before starting the evaluation run or generally wrap the training and evaluation into functions as Python will delete objects when exiting functions if these are not referenced anymore.

Training runs 50% slower when using 2 GPUs comparing to 1

train.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-04-06_21:18:35 host : f56b97aeeef1 rank : 0 (local_rank: 0) exitcode : -9 (pid: 19681) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 19681

train.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-04-09_10:43:14 host : f56b97aeeef1 rank : 0 (local_rank: 0) exitcode : -9 (pid: 12288) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 12288

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-04-06_21:18:35
host : f56b97aeeef1
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 19681)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 19681

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-04-09_10:43:14
host : f56b97aeeef1
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 12288)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 12288