In the discussion, I find that this makes great sense to me: Often memory leaks are created by trying to store some training information like the loss without detaching it from the computation graph, which will store the whole graph with it.
However, I don’t understand why the memory usage remains the same when I just add all the outputs into a variable (no detach operation). I think I am still storing the training information, right?
If you append a tensor with an attached computation graph (valid .grad_fn), the computation graph will be stored with it and should could see an increased memory usage in each iteration.
Is this the case for you or are you seeing any other issue?
Thank you for the reply! May post some code here, so that I can explain it more clearly.
net = net.cuda()
sum = 0
while True:
batch = 4
h = 3
w = 3
num_outputs = 5
x = torch.randn(batch,1, h, w).cuda()
y = net(x)
sum += y
So basically, I am just summing all the outputs, and the computation graph should be attched to the outputs, right? However, the GPU usage remains the same. I don’t understand why.
Thank you for the reply! May post some code here, so that I can explain it more clearly.
while True:
x = torch.randn(batch,1, h, w).cuda()
y = net(x)
sum += y
So basically, I am just summing all the outputs, and the computation graph should be attched to the outputs, right? However, the GPU usage remains the same. I don’t understand why.
PyTorch uses a custom caching memory allocator, which will try to reuse the device memory.
Thus nvidia-smi shows you the overall memory usage including the CUDA context, the allocated and the cached memory (also from other processes), which might show the increased memory usage after a couple of steps for small model.