GPU memory usage increase with the large dataset


I am writing an NLP code by myself to convert the Natural texts to the code. When I am using the smaller dataset let’s say 500 KB file. My code runs very fast. I can use larger batch size without any issue but when I am using the larger dataset like 50 MB file. The GPU memory on Google Colab suddenly increases too much. It even crosses 40 GB the maximum possible until I keep the batch size 1 and use the accumulation gradient and use the SGD optimizer instead of Adam. And then it takes long time to train around 8 hour.

My question is Why large dataset size is affecting the GPU memory. It shouldn’t. Since I am moving only source and target during training time to the GPU. And Model size should be fixed also.

At the same time for the smaller dataset also shouldn’t the model predict different output each time. Why it is predicting the same thing? I am asking two different unrelated question here probably.

This is how I am creating the batch of dataset.


generate the batch of the data and also add pads. Batch format for faster training.

def generate_batch(data_batch):
  src_batch = []
  trg_batch = []
  for (nl_item, code_item) in data_batch:
    #print ("src batch type", type (src_batch))
    src_batch.append([torch.tensor([BOS_IDX]), nl_item, torch.tensor ([EOS_IDX])], dim=0))
    trg_batch.append ([torch.tensor([BOS_IDX_CODE]), code_item, torch.tensor ([EOS_IDX_CODE])], dim=0))

  src_batch = pad_sequence(src_batch, padding_value = PAD_IDX)
  trg_batch = pad_sequence (trg_batch, padding_value = PAD_IDX_CODE)

  return src_batch, trg_batch

### pytorch dataloader function. 

train_iter = DataLoader (train_data, batch_size = BATCH_SIZE, shuffle=True, collate_fn= generate_batch)

val_iter = DataLoader (val_data, batch_size = BATCH_SIZE, shuffle=True, collate_fn= generate_batch)

Could you print the input and target shapes used inside the DataLoader training loop for the “small” and “large” datasets?
I guess these shapes increase significantly thus also increasing the GPU memory usage.

1 Like

I will do that. The big files do have the large examples. But is there any way to determine of memory usage cause whether it’s because of the memory leak or is it the valid memory consumption?

I don’t know what “large examples” means in this context but assume it’s related to another dimension such as the sequence length. Assuming this dimension increases in its size then a memory increase would also be expected assuming you are not cutting the samples to a predefined shape.

To determine if a larger input increases the memory usage I would suggest to execute the training step only once with the small and large sample and compare the memory usage. A “memory leak” usually refers to an expected increase in memory usage caused by e.g. storing tensors with their computation graph in e.g. a list, which would not be visible in a single step.

1 Like