Memory leak from unowned inference

I have a minimal example that increments GPU usage by ~1GB each time it is run.


Running this line multiple times in a Jupyter notebook increments the GPU utilization as shown by nvidia-smi by ~1GB each time I run it. This happens even if I run the gc.collect() and torch.cuda.empty_cache(). This is surprising to me because I would expect that since there is no variable referencing the result of the inference, it should be possible to garbage collect it. Any suggestions for how to debug this further?

simpleGPT2 is my implementation of a Transformer:

class SimpleGPT2(t.nn.Module):
    def __init__(self, n_blocks = 1, vocab_size = 50257, context_length = 1024, hidden_size = 768, p_dropout = 0.1):
        self.wte = t.nn.Embedding(vocab_size, hidden_size)
        self.wpe = t.nn.Embedding(context_length, hidden_size)
        self.pe_matrix = t.nn.Parameter(t.arange(0, context_length).unsqueeze(0), requires_grad = False)
        self.dropout = t.nn.Dropout(p_dropout)
        self.gpt_blocks = t.nn.ModuleList([GPTBlock() for _ in range(n_blocks)])
        self.layernorm = t.nn.LayerNorm(hidden_size) = t.nn.Linear(hidden_size, vocab_size)

        for layer in [self.wte, self.wpe,]:
    def forward(self, input_ids: t.Tensor, attention_mask = t.Tensor):
        x = input_ids
        n, seq_len = x.shape
        hidden = self.wte(x) + self.wpe(self.pe_matrix.expand(n, -1))
        hidden = self.dropout(hidden)
        for gpt_block in self.gpt_blocks:
            hidden = gpt_block(hidden)
        hidden = self.layernorm(hidden)

and encoded_input is a dictionary storing two Tensors of shape [1, 1024], i.e., the input and attention mask.

Check if you are storing any tensors in e.g. a list which might still be attached to the computation graph such as the model output or the loss. This will not only store the actual tensor but also the entire computation graph thus increasing the memory usage. Detach the tensor before storing it in the list to fix it.