CPU RAM saturated by tensor.cuda()

Hi,

I am noticing a ~3Gb increase in CPU RAM occupancy after the first .cuda() call. The CPU RAM occupancy increase is partially independent from the moved object original CPU size, whether it is a single tensor or a nn.Module subclass.

I’ve noticed this behavior in my workstation:

OS: Ubuntu 20.04.4 LTS
Processor: Intel® Xeon(R) W-2223 CPU @ 3.60GHz × 8
GPU: 2*RTXA6000
Pytorch: 1.10.0+cu111

And also on the Nvidia Jetson Nano.

Here is a minimal example that, defining a one-element tensor, results in a +3Gb increase on CPU RAM occupancy

import os
import torch

def get_ram():
    
    _, cpu_used, cf= map(int, os.popen('free -t -m').readlines()[-1].split()[1:])
    
    gpu_allocated = (torch.cuda.memory_allocated(0))/10**6
    
    return cpu_used,gpu_allocated

# Initial GPU/CPU occupation
c0,g0  = get_ram()

# Occupation after defining a tensor on CPU
tensor = torch.tensor([0])
c1,g1  = get_ram()

# Occupation after moving the tensor to CUDA
tensor = tensor.cuda()
c2,g2  = get_ram()

print('Occupancy after CPU tensor definition:')
print()
print('CPU used: \t{}\tMb'.format(c1-c0))
print('GPU used: \t{}\tMb'.format(g1-g0))
print()
print('Occupancy after moving tensor to CUDA:')
print()
print('CPU used: \t{}\tMb'.format(c2-c0))
print('GPU used: \t{}\tMb'.format(g2-g0))

output:

Occupancy after CPU tensor definition:

CPU used: 1 Mb
GPU used: 0.0 Mb

Occupancy after moving tensor to CUDA:

CPU used: 3297 Mb
GPU used: 0.000512 Mb

I do not known what is the cause of this behavior. Is it normal? Can it be mitigated?

It is particularly problematic for me as I am developing DNN applications on Jetson Nano, and this RAM jump almost saturates the available memory by default

My understanding is that this is the result of libtorch_cuda.so being loaded when the first CUDA function is called (in order to copy the tensor to the GPU), so it is expected behavior. I’m not sure there are any great workarounds at the moment other than the highly involved steps of using a custom source build that removes unnecessary functions from being loaded.

On the other hand, I doubt that all or even most of the loaded CUDA functions are “hot,” so this might be a case where increasing the swap size of your system could help as hopefully these functions are paged out in favor of hotter memory usage (e.g., actual tensors used in the computation).

Thanks, this clarifies the problem. I would have thought that all the necessary libraries were loaded at the first torch import, but this is not the case.

When I am doing inference with the Jetson , the 4Gb RAM is full and the 4GB swap partition is around half full. So, the system is able to handle the cuda libraries loading moving what is not necessary to swap. The main effect of this behavior is then that the “hot” component of the memory consumption is masked, and the RAM appears always full. It is not a critical problem, but it is useful to known it