Scope and memory consumption of tensors created using self.new_* API

I am struggling to fit my model in a 16GB gpu due to cuda out of memory error. What is even more intriguing is that the model runs fine for roughly first 2000 steps and the memory allocated as per nvidia-smi starts increasing from 14GB to 16GB gradually and then crash finally. I have lot of tensors declared in the forward function using method new_tensor or new_zeros which I suspect are not getting dereferenced or freed from memory and that’s why the accumulation from 14GB to 16GB is happening. Here is a dummy code

class Test(nn.Module):
    def __init__(self):
        super(Test, self).__init__()
        self.weights = nn.Parameter(torch.zeros(5,5))
    def forward(self, x):
        dummy_constant = x.new_ones(self.weights.shape[0], x.shape[1])
        output = self.weights @ x
        output += dummy_constant
        return output
model = Test()
for i in range(1,100):
    x = torch.rand(5,i)
    out = model(x)
    #loss.backward() and other stuff

So all in all, will every instance of dummy_constant stay in memory even when it goes out of scope?

1 Like

It should be deleted when output goes out of scope.
Are you storing output or your loss somehow?

Thanks for the quick reply. No, it is not being stored, infact I have also tried doing del loss, logits, inputs explicitely. Is there any recommended way of looking memory usage/allocated on gpu by tensors?

You could use this code snippet to see all currently allocated tensors.

3 Likes

I get 3.97 GB as output from the code snippet(I have modified it slightly, please check) and if I use del on loss, outputs and other parameters used/created in the training step then it reduces to 2.35 GB. Main problem is that nvidia-smi is showing consumption of complete 16GB.
Here is the modified code snippet -

total = 0
for obj in gc.get_objects():
  try:
    if torch.is_tensor(obj) or (hasattr(obj, 'data') and torch.is_tensor(obj.data)):
      if len(obj.size()) > 0:
        if obj.type() == 'torch.cuda.FloatTensor':
          total += reduce(lambda x, y: x*y, obj.size()) * 32
        elif obj.type() == 'torch.cuda.LongTensor':
          total += reduce(lambda x, y: x*y, obj.size()) * 64
        elif obj.type() == 'torch.cuda.IntTensor':
          total += reduce(lambda x, y: x*y, obj.size()) * 32
        #else:
          # Few non-cuda tensors in my case from dataloader
  except Exception as e:
    pass
print("{} GB".format(total/((1024**3) * 8)))

Some more updates on the issue:
If I disable backprop part of the model then the snippet gives 0.78GB as output but the program crashes very fastly when I don’t use del logits in this case as it doesn’t use existing memory(maybe loss.backward() has some flag which allows to reclaim unused part of GPU

1 Like

I wonder if gc.get_objects() actually gives all the objects declared(specially the ones on GPU), maybe there is some C level API which is creating object and the reference is dangling somewhere out of the scope and gc is not able to access those objects. I am not sure about this but let me know your thoughts :slightly_smiling_face:

nvidia-smi might show a higher usage, since PyTorch uses a caching memory allocator.
Have a look at the memory management for more information.

Your current problem is that your GPU runs out of memory, if you don’t delete logits?
Could you post your code then please? It might be due to unwanted storing of logits, thus keeping the computation graph alive.

After some more debugging, it turned out that my model was close to the point of going OOM since the beginning but only when input with large number of batch size arrived it actually crashed. I was able to delay this event of crash by doing del logits in the end which apparently gave more room for the larger batch size but nevertheless it crashed after many epochs. The reason why model is not able to fit in the gpu seems a bit bizarre and I have opened separate discussion thread regarding that.
Many many thanks for support :slightly_smiling_face: