I have a minimal example that increments GPU usage by ~1GB each time it is run.
simpleGPT2(**encoded_input)
Running this line multiple times in a Jupyter notebook increments the GPU utilization as shown by nvidia-smi by ~1GB each time I run it. This happens even if I run the gc.collect() and torch.cuda.empty_cache(). This is surprising to me because I would expect that since there is no variable referencing the result of the inference, it should be possible to garbage collect it. Any suggestions for how to debug this further?
simpleGPT2 is my implementation of a Transformer:
class SimpleGPT2(t.nn.Module):
def __init__(self, n_blocks = 1, vocab_size = 50257, context_length = 1024, hidden_size = 768, p_dropout = 0.1):
super().__init__()
self.wte = t.nn.Embedding(vocab_size, hidden_size)
self.wpe = t.nn.Embedding(context_length, hidden_size)
self.pe_matrix = t.nn.Parameter(t.arange(0, context_length).unsqueeze(0), requires_grad = False)
self.dropout = t.nn.Dropout(p_dropout)
self.gpt_blocks = t.nn.ModuleList([GPTBlock() for _ in range(n_blocks)])
self.layernorm = t.nn.LayerNorm(hidden_size)
self.final = t.nn.Linear(hidden_size, vocab_size)
for layer in [self.wte, self.wpe, self.final]:
init_layer(layer)
def forward(self, input_ids: t.Tensor, attention_mask = t.Tensor):
x = input_ids
n, seq_len = x.shape
hidden = self.wte(x) + self.wpe(self.pe_matrix.expand(n, -1))
hidden = self.dropout(hidden)
for gpt_block in self.gpt_blocks:
hidden = gpt_block(hidden)
hidden = self.layernorm(hidden)
return self.final(hidden)
and encoded_input is a dictionary storing two Tensors of shape [1, 1024], i.e., the input and attention mask.