import torch.nn as nn
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.output_layer = nn.Linear(1, 1)
net = Net()
# normal memory usage
net = net.cuda()
# python consumes about 2gb of host memory
# now waste time so we can look at the memory usage
b = 23
for i in range(1000000000):
for j in range(1000000000):
b = b + i % 3434 + j % 3553
b = b % 2354
print(f"aha, {b}")
when running this with linux (kubuntu 19.10), python 3.7 and pytorch 1.3, host memory usage is about 2gb and the gpu memory usage increases by about 700mb.
I understand that there is some sort of memory pool when using cuda (so 700mb gpu memory sounds ok). I can imagine that it might be necessary to have a copy of all gpu memory on the host. but the host memory is almost triple the gpu memory, which seems odd to me.
This is most likely only the memory used by the cuda driver
import torch
# No memory/process on the gpu + 180MB of cpu memory
input("Press enter to continue")
# Store a cpu Tensor
a = torch.rand(1)
# Still nothing on the GPU, still 180MB on the cpu
input("Press enter to continue")
# Store a Tensor with one element on the gpu (the Tensor takes 4bytes)
a = torch.rand(1, device="cuda")
# Cuda driver is not initialized and uses 543MB on my gpu and almost 2GB - 180MB on the cpu
input("Press enter to continue")
From what I recall, there is a fixed overhead to the cuda runtime. And the an overhead that depends on the amount of cuda code there is in the whole codebase (not sure why). And we do have a large number of cuda kernels as we have pytorch’s own ones, cudnn, magma etc.
ok. well, i guess it’s like loading a standard c/c++ library. the whole code is loaded into memory, which is a lot in case of pytorch (as you said).
and in case of cuda it might take more space than standard c/c++ as it needs to load the device independent version and possibly a device specific cached version.
still, to me it seems like a big inefficiency (code could be loaded on demand etc). for me (research) that isn’t an issue, but running inference on embedded devices can be impossible due to that.
Update (I hope it might help somebody):
i started to write my own C++/CUDA extension, along with a small C++ starter. I want to use that starter for easier debugging and profiling using cuda_gdb and friends. After making everything work with the cmake build system, I started up nsight to step into the code. Right after calling Tensor::cuda(), the debugger hanged. I thought that something is not working, but turns out that it just takes one minute and 40 seconds for that one step. After that stepping is back to normal. I guess that loading all these kernels inside cuda_gdb is super slow.