Mini-batch within train loop

Long story short, I cannot modify the input batch size of 128 of my data loader. When I do:

for batch_idx, o_t in enumerate(train_loader):
    o_t =
    y = model(o_t)

I get a CUDA out of memory error.

To get around this, I tried the following:

for batch_idx, o_t in enumerate(train_loader):
    mini_batch_size = 16
    y = []
    for mini_batch_idx in range(int(128/mini_batch_size)):
        start, end = mini_batch_idx*mini_batch_size, (mini_batch_idx+1)*mini_batch_size
        o_t_mini = o_t[start:end]
        o_t_mini =

        y_mini = model(o_t_mini)

        o_t_mini, y_mini = o_t_mini.cpu(), y_mini.cpu()


y =, dim=0)

However, this does not help either, and I observe that GPU memory usage increases linearly after each forward pass, and not after the tensors are moved to the device. Moving the tensors back to cpu has no effect on GPU memory, either.

Any ideas why this is the case, and how I could get around this? Thanks!

Is this a very massive model? You may want to try with a simple toy module such as nn.Conv2d and check if it still crashes. Most of the GPU memory may be taken by the model, not the data

I see. But I don’t understand why it would increase with every forward pass – shouldn’t it remain constant once the model parameters have been initialized?

Unfortunately, I don’t remember why, but I also found in the past that the memory would increase for the first couple of epochs before stabilizing.

The increase in memory is expected, since you are storing each y_mini output in y alone with the complete computation graphs.
If you want to reduce the memory usage, you could e.g. calculate the loss and gradients using y_mini.
This would free the intermediate tensors (needed to calculate the gradients), as they are currently all stored on the GPU.