I am writing an NLP code by myself to convert the Natural texts to the code. When I am using the smaller dataset let’s say 500 KB file. My code runs very fast. I can use larger batch size without any issue but when I am using the larger dataset like 50 MB file. The GPU memory on Google Colab suddenly increases too much. It even crosses 40 GB the maximum possible until I keep the batch size 1 and use the accumulation gradient and use the SGD optimizer instead of Adam. And then it takes long time to train around 8 hour.
My question is Why large dataset size is affecting the GPU memory. It shouldn’t. Since I am moving only source and target during training time to the GPU. And Model size should be fixed also.
At the same time for the smaller dataset also shouldn’t the model predict different output each time. Why it is predicting the same thing? I am asking two different unrelated question here probably.
This is how I am creating the batch of dataset.
generate the batch of the data and also add pads. Batch format for faster training.
def generate_batch(data_batch): src_batch =  trg_batch =  for (nl_item, code_item) in data_batch: #print ("src batch type", type (src_batch)) src_batch.append(torch.cat([torch.tensor([BOS_IDX]), nl_item, torch.tensor ([EOS_IDX])], dim=0)) trg_batch.append (torch.cat([torch.tensor([BOS_IDX_CODE]), code_item, torch.tensor ([EOS_IDX_CODE])], dim=0)) src_batch = pad_sequence(src_batch, padding_value = PAD_IDX) trg_batch = pad_sequence (trg_batch, padding_value = PAD_IDX_CODE) return src_batch, trg_batch ### pytorch dataloader function. train_iter = DataLoader (train_data, batch_size = BATCH_SIZE, shuffle=True, collate_fn= generate_batch) val_iter = DataLoader (val_data, batch_size = BATCH_SIZE, shuffle=True, collate_fn= generate_batch)