RAM usage scales with number of GPUs?

I’m trying to train a model on 8 GPUs at once (on the same machine), using DistributedDataParallel. My issue is that this ends up using a lot of memory and I have to “cripple” my dataloader a bit in order even make it fit on the machine’s 64 GB of RAM.

The reason seems to be a combination of two things. First, my dataloader upon being initialized loads the entire dataset (about 3GB total) into RAM, so that loading batches later will be very fast. Second, I can see that a separate python process is created for each GPU that I train on (even with num_workers=0 in the dataloader), and each of these processes uses a substantial amount of RAM, presumably because each process loads its own copy of the entire dataset into RAM. This scales linearly with the number of GPUs I use.

I’m not sure the above fully explains the problem as the data is only 3GB, and yet each process uses about 6GB of RAM - unless perhaps the data is stored in RAM in a more bulky (e.g. lower-precision) format than on disk? Or perhaps the remaining RAM is because of other things being stored in RAM, like model weights or gradients (but then I thought those should live in GPU memory)?

I guess my main question is: is this normal / inevitable behavior? If the problem is indeed that 8 copies of the dataset end up getting stored in RAM, is there some way to avoid this and load it into RAM only once? Or is this issue caused by something else?

Maybe this issues can answer you.

Thanks for the pointer! I agree it looks relevant but having read through and tried some of the things suggested there I think it might be a different issue. For me the problem isn’t caused by having num_workers>0 in the dataloader. If I train on just 1 GPU with num_workers>0, there is no problem. If I train on multiple GPUs with num_workers=0, the problem does occur.

Having read some other documentation, I wonder if this is actually the normal behavior for DistributedDataParallel. E.g. this article says about DDP (emphasis mine):

  1. Each worker maintains its own copy of the model weights and its own copy of the dataset.

However I’m not sure whether they mean this copy is maintained in VRAM on the corresponding GPU, or if it’s maintained in RAM through a separate subprocess for each GPU (the latter seems to be occurring in my case).

For DDP each process manages one GPU by default and have its own dataloader, it does maintain its own copy of model weights and dataset in each process (RAM) as you observed. If you would like to only have one shared dataset across different processes, you can try writing a custom data loader with shared memory, for example you can refer to this post on how to do it How to cache an entire dataset in multiprocessing?

1 Like

Thanks, that was helpful. Now I know that this essentially expected behavior. The thread you linked is also very helpful, although so far I haven’t been able to get DistributedSampler to work correctly.

Having looked into it more, I think I’m going to pursue a different solution for now, where I just don’t load the whole dataset into RAM, but instead load batches from the hard drive when necessary. I figured pre-loading all the data into RAM would provide a big speedup, but having thought about it more and based on advice from others, I now think this intuition is not correct, since using multiple workers allows upcoming batches to be pre-loaded into RAM anyway.