Reduce CPU memory usage while enabling non_blocking CPU offload

I have a training pipeline which offloads various components (model, model ema, optimizer) to CPU at various training step stages, and does so asynchronously (e.g. for each data buffer, calling buffer.to('cpu', non_blocking=True).

Making these transfers non-blocking results in significant speed increases (almost 2x). However, it also blows up the CPU RAM usage and results in my training process getting OOMKilled. My guess is that due to the large number of in-flight transfers to/from CPU, the process reserves more RAM than it really needs.

Is there any way to achieve a middle ground here? I could just make all the transfers blocking and the training runs fine, but it would be cool to somehow give a hint to pytorch about maximum RAM usage in this async setting.

I don’t fully understand this statement as I would assume your script needs the pinned memory to be able to move the tensors asynchronously to it. Why wouldn’t the process need this memory in your opinion?

I just meant that rather than get OOMkilled I would prefer that the process automatically pin less memory and potentially incur some speed cost by blocking on some of those transfers. Is there any way to achieve that?

PyTorch won’t pin memory behind your back and you would either need to use it via the DataLoader or explicitly on tensors. It thus also won’t be able to reduce its usage and you might need to reduce the number of workers or avoid pinning too many tensors manually.