Proper way to manage memory in multi-process scenario

Dear Community,

I wanted to ask about proper methods of GPU memory management in a distributed scenario, especially 1) how to pass gradients between processes, 2) how to track the use of memory space for different processes, 3) the recommended method of releasing the reserved memory if the underlying code is extensively used deep copies of the original tensors.

I am currently tasked with writing a Python library for simulating a distributed learning process - and I use PyTorch as a boilerplate for neural networks training. I initialise a net, then deep copy it to n clients and perform a number of training rounds.

The executed code relies heavily on using a number of workers training simultaneously and then returning the model’s weights. As it is, for now, the code relies heavily on the deep copies of the gradients, as the PyTorch does not allow to pass the gradients from child processes otherwise (in short, I must perform a deep copy of gradients before sending them back to central orchestrator).

However, I have observed that I suffer from a ‘memory leak’ somewhere. Although it is not a real memory leak, it seems that my code is not releasing the reserved memory in a proper manner, and the reserved memory space is growing bigger over time. This can lead to a following error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 23.65 GiB total capacity; 10.33 GiB already allocated; 45.06 MiB free; 11.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

It seems that my code is not working correctly at the level of realising properly shared CUDE tensors.

Therefore, I would like to ask for tips and recommendations for debugging this problem. Is there a way to inspect how much memory each device and process is currently reserving? Moreover, what would be the recommended way to initialize and pass tensors between child processes? I’ve inspected the documentation, and it seems that creating deep copies will increase memory usage - but it should not accumulate over time as it is now. Is there a was to ‘force-release’ memory from selected devices?

I am very grateful for all your help - as the usage of devices and memory allocation is much more tricky in a distributed scenario.