Implement batch gradient descent by accumulate loss for multiple samples


So my training samples are of different sizes, for which reason I can’t stack them into one tensor and do in the traditional bgd way. So I created a custom collate_fn() to store all samples in a batch to a list, and during training, I go over each sample in the list and perform forward pass as usual. Since I want to implement the bgd in this way, I summed up the losses and backprop at the end. However, I found that the GPU memory usage is the same as the case where batch size = 1. I assume all input samples within a batch should be kept in GPU memory in order to retain all these graphs. So does this indicate this implementation is not working? If so, what’s the correct way to do so? Any help would be highly appreciated.

I also post my code script for training as follows:

for data_batch, label_batch in loader:
    loss = 0
    for (data, label) in zip(data_batch, label_batch):
        data = data.float().to(
        label = label.long().to(

        output = self.model(data)
        loss += self.loss(output, label)