Hi, I’m working on 16
x nodes, each with 8x A100
gpus and when using DDP, it appears that there are certain spiked in the used RAM which reach the limit of the 1.1TB
available on p4d
instances.
- This arises on the host node (above graph), but I can’t confirm whether it arises on other nodes as well. Presumably, that doesn’t happen but if the host node goes down - the entire training does too.
============================================================
scripts/ddp_convnext.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2022-08-26_22:41:31
host : gpu-st-p4d-24xlarge-56.hpc-1click-prod450.pcluster
rank : 77 (local_rank: 5)
exitcode : -6 (pid: 28176)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 28176
------------------------------------------------------------
.......
.... [Multiple other nodes's stdout] ...
.......
srun: error: gpu-st-p4d-24xlarge-57: task 13: Exited with exit code 1
srun: error: gpu-st-p4d-24xlarge-46: task 2: Exited with exit code 1
slurmstepd: error: Detected 1142 oom-kill event(s) in StepId=4236.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: gpu-st-p4d-24xlarge-51: task 7: Exited with exit code 1
srun: error: gpu-st-p4d-24xlarge-55: task 11: Out Of Memory
Code: Main script
How can we reduce the memory footprint when doing DDP multi-node? I feel 1.1 TiB
per node should be enough - but that doesn’t seem to be.
Interestingly, this problem arises only when I scale processes/GPUs - in a certain limit, stuff works quite well enough. I also had to disable some GPUs to save memory, so now I can only run 8x nodes, each with 6xA100
instead of the 8xA100s
available per node. Increasing processes also makes training slower. I’m unsure where the problem here - some guidance would be very, very welcome
EDIT:- Here’s also a truncated SLURM log: Gist - It’s quite a few thousand lines, so It’s the raw link. I’ve only removed argparse, some echo
s and stuff like model summaries.