Reducing RAM usage with DDP

Hi, I’m working on 16x nodes, each with 8x A100 gpus and when using DDP, it appears that there are certain spiked in the used RAM which reach the limit of the 1.1TB available on p4d instances.

image

  • This arises on the host node (above graph), but I can’t confirm whether it arises on other nodes as well. Presumably, that doesn’t happen but if the host node goes down - the entire training does too.
============================================================
scripts/ddp_convnext.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-08-26_22:41:31
  host      : gpu-st-p4d-24xlarge-56.hpc-1click-prod450.pcluster
  rank      : 77 (local_rank: 5)
  exitcode  : -6 (pid: 28176)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 28176
------------------------------------------------------------

.......
.... [Multiple other nodes's stdout] ...
.......
srun: error: gpu-st-p4d-24xlarge-57: task 13: Exited with exit code 1
srun: error: gpu-st-p4d-24xlarge-46: task 2: Exited with exit code 1
slurmstepd: error: Detected 1142 oom-kill event(s) in StepId=4236.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: gpu-st-p4d-24xlarge-51: task 7: Exited with exit code 1
srun: error: gpu-st-p4d-24xlarge-55: task 11: Out Of Memory


Code: Main script

How can we reduce the memory footprint when doing DDP multi-node? I feel 1.1 TiB per node should be enough - but that doesn’t seem to be.

Interestingly, this problem arises only when I scale processes/GPUs - in a certain limit, stuff works quite well enough. I also had to disable some GPUs to save memory, so now I can only run 8x nodes, each with 6xA100 instead of the 8xA100s available per node. Increasing processes also makes training slower. I’m unsure where the problem here - some guidance would be very, very welcome :slight_smile:

EDIT:- Here’s also a truncated SLURM log: Gist - It’s quite a few thousand lines, so It’s the raw link. I’ve only removed argparse, some echos and stuff like model summaries.

Did you try to profile your code and check where the spikes might be coming from?
Based on the posted code snippet it seems you are pre-loading the dataset:

self.dataset = tfds.as_numpy(self.ds)

and might have a large batch size:

return 45491349 // args.batch_size

which might increase the memory usage significantly depending on the number of workers in the DataLoader in each DDP process.
Also, it seems you are mixing TensorFlow with PyTorch and I don’t know how TF would behave in this setup.

I wasn’t using a dataloader - I profiled without it - rentry.co link

The usage seems nominal, and not too much on the terabyte of RAM available.

True, but I am converting it to a standard numpy iteratore, wrapping it into IterableDataset so the it should be like any standard PyTorch Dataset - no?

The batch size 48 :thinking:

Yes, this would behave like the “standard” eager Dataset, is pre-loading the large dataset to the host in each DDP process (and each worker if multi-processing is used).

Ah, sorry. I was staring at other C++ code and saw the // args.batch_size as a comment, not a integer division.

:frowning: That’s unfortunate, it’s supposed to be an iterator because the dataset is humongous (~20TB size). Would you happen to have any general direction I can take to debug this? would reducing workers for the tf.dataset itself help?

You could try to use np.memmap to avoid loading the data. However, given the size of the dataset (20TB) and the RAM of your system (1TB), I don’t think you are indeed pre-loading the data and would need to explain what a

is.

1 Like

Pretty much this:

Converting it to a generator of numpy arrays

EDIT: It appears del-ing a few variables helps to reduce memory usage. I can use 6x A100s on 16 nodes, but still can’t use all 8 GPUs per node due to memory errors. Any ideas @ptrblck