Moving a tiny model to cuda causes a 2Gb host memory allocation

Hi,
i have the following code

import torch.nn as nn

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.output_layer = nn.Linear(1, 1)

net = Net()
# normal memory usage
net = net.cuda()
# python consumes about 2gb of host memory

# now waste time so we can look at the memory usage
b = 23
for i in range(1000000000):
    for j in range(1000000000):
        b = b + i % 3434 + j % 3553
        b = b % 2354

print(f"aha, {b}")

when running this with linux (kubuntu 19.10), python 3.7 and pytorch 1.3, host memory usage is about 2gb and the gpu memory usage increases by about 700mb.

I understand that there is some sort of memory pool when using cuda (so 700mb gpu memory sounds ok). I can imagine that it might be necessary to have a copy of all gpu memory on the host. but the host memory is almost triple the gpu memory, which seems odd to me.

would be cool if somebody could explain that :slight_smile:

Hi,

This is most likely only the memory used by the cuda driver :confused:

import torch
# No memory/process on the gpu + 180MB of cpu memory
input("Press enter to continue")

# Store a cpu Tensor
a = torch.rand(1)
# Still nothing on the GPU, still 180MB on the cpu
input("Press enter to continue")

# Store a Tensor with one element on the gpu (the Tensor takes 4bytes)
a = torch.rand(1, device="cuda")
# Cuda driver is not initialized and uses 543MB on my gpu and almost 2GB - 180MB on the cpu
input("Press enter to continue")
1 Like

It sounds very excessive to me, but i guess I’ll just have to accept that.

From what I recall, there is a fixed overhead to the cuda runtime. And the an overhead that depends on the amount of cuda code there is in the whole codebase (not sure why). And we do have a large number of cuda kernels as we have pytorch’s own ones, cudnn, magma etc.

1 Like

ok. well, i guess it’s like loading a standard c/c++ library. the whole code is loaded into memory, which is a lot in case of pytorch (as you said).

and in case of cuda it might take more space than standard c/c++ as it needs to load the device independent version and possibly a device specific cached version.

still, to me it seems like a big inefficiency (code could be loaded on demand etc). for me (research) that isn’t an issue, but running inference on embedded devices can be impossible due to that.

cheers for the answer :slight_smile:

edit:
I also found this:
https://devtalk.nvidia.com/default/topic/1044191/determine-memory-cuda-context-memory-usage/?offset=5
they have a similar problem, although on a different scale (30mb vs 180mb allocation when only instantiating the kernel on different gpus). an nvidia engineer answered that the allocation depends on the gpu (number of SMs). so i guess in my case (rtx 2080) it’s pretty large.

Update (I hope it might help somebody):
i started to write my own C++/CUDA extension, along with a small C++ starter. I want to use that starter for easier debugging and profiling using cuda_gdb and friends. After making everything work with the cmake build system, I started up nsight to step into the code. Right after calling Tensor::cuda(), the debugger hanged. I thought that something is not working, but turns out that it just takes one minute and 40 seconds for that one step. After that stepping is back to normal. I guess that loading all these kernels inside cuda_gdb is super slow.

Similar things were also reported in the nvidia forums (thought that was 2009): https://forums.developer.nvidia.com/t/cuda-gdb-performance/9175