CPU RAM Leak when using Huggingface BERT model for inference

I am attempting to use my fine-tuned DistilBERT model to extract the embedding of the ‘[CLS]’ token. For every row in my dataset I want to extract this feature and return the result into an array.

However, my code seems to be suffering from a memory leak. I have roughly ~14K rows in my dataset and by the time the code has finished executing, Google Colab has either crashed or reports that I have used almost all 25GB of RAM!

Each embedding is a tensor with 768 elements. So for 14K elements, the returned array should be on the order of 20-30 MBs.

Here is my function that is failing from the memory leak:

def getPooledOutputs(model, encoded_dataset, batch_size = 128):
  pooled_outputs = []
  print("total number of iters ", len(encoded_dataset['input_ids'])//batch_size + 1)

  for i in range(len(encoded_dataset['input_ids'])//batch_size + 1):
    up_to = i*batch_size + batch_size

    if len(encoded_dataset['input_ids']) < up_to:
      up_to = len(encoded_dataset['input_ids'])

    input_ids = th.LongTensor(encoded_dataset['input_ids'][i*batch_size:up_to])
    attention_mask = th.LongTensor(encoded_dataset['attention_mask'][i*batch_size:up_to])
    with torch.no_grad():
      embeddings = model.forward(input_ids=input_ids, attention_mask=attention_mask, output_hidden_states=True)['hidden_states'][-1][:,0] # Pooled output
      pooled_outputs.extend(embeddings)

  return pooled_outputs

Can you tell if memory usage is growing with each iteration? (e.g., if you forcibly reduce the number of iterations, does the memory usage go down?)

I’ve tried using a fourth of my dataset and I did not see a concerning amount of RAM usage. It is only towards the very end of the loop using the entire dataset does my RAM usage skyrocket. This is solely from monitoring what the Google Colab environment displays to me.

This image can give you an idea:
RAM