High CPU Memory Usage

divyesh_rajpura · May 30, 2021, 7:12pm

When I run my experiments on GPU, it occupies large amount of cpu memory (~2.3GB). However, when I run my exps on cpu, it occupies very small amount of cpu memory (<500MB). This memory overhead restricts me on training multiple models.

Can someone please help me on debug whuch component is causing this memory overhead?

I have added sample code at GitHub - divyeshrajpura4114/asv-sample

tom · May 30, 2021, 7:35pm

This is likely the CUDA initialization. PyTorch comes with a relatively large number of kernels and CUDA does something with them on startup. This is particularly difficult on more constrained platforms like the Jetson.
I thought of patching PyTorch to load all kernels through nvrtc instead of linking them into PyTorch but it is quite a bit of work and I was hoping someone else would fix things instead (with the recent split of libtorch_cuda, it seems people are digging in various directions there even if we’re not there yet).

Best regards

Thomas

divyesh_rajpura · May 31, 2021, 3:13am

@tom Thanks for your response.

These are bit of new concepts for me. But, what I understand is that the CPU memory usage will be high whenever we train a model on GPU.

I have one more question. When I increase num_workers to 4 in DataLoader, each 4 process is taking high CPU memory (as I mensioned above ~2.3GB)?? Is it normal then?

divyesh_rajpura · June 1, 2021, 5:51am

@ptrblck @tom , I have multiple GPUs and also have large CPU ram, but when I start 2 training simultaneously the response time is highly degraded because of memory consumption (may be its doing processing on cpu also at some extent). Is there any way to improve simultaneous trainings?

tom · June 1, 2021, 6:55am

You mean except buy more RAM? (This is only half-joking. I only have a single GPU but I chose to max out my computer’s RAM capacity (which is 128GB) - compared to GPU prices, that seems only reasonable. For many commercial situations “buy more RAM” might be the solution.)

More seriously: At least part of the memory usage is somewhat fundamental to how Python multiprocessing works and doesn’t work, combined with limitations of how CUDA works.
But so one thing you can look into is to split part of the processing in the dataset to preprocessing and move other parts (i.e. augmentation) to the GPU and then get by with less processes for the dataloader.
For real-world applications I have rarely seen a datapipeline that could not be drastically sped up with some tweaks.

Best regards

Thomas