Larger GPU memory usage during the first batch

Hi, I have a training code that runs on a single GPU like this:

def train():

    for batch_idx, (inputs, targets) in enumerate(train_loader):
        inputs = inputs.cuda(non_blocking=True)
        targets = targets.cuda(non_blocking=True)

        outputs = model(inputs)
        loss = criterion(outputs, targets)


        print(f'Max memory allocated: {torch.cuda.max_memory_allocated() // 1024 // 1024}')

def main():
    model = Net().cuda()
    cudnn.benchmark = True

    train_dataset = MyDataset()
    train_loader =, batch_size=32, shuffle=True, pin_memory=True)

    optimizer = torch.optim.SGD(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()

    print(f'Max memory before training: {torch.cuda.max_memory_allocated() // 1024 // 1024}')
    train(train_loader, model, optimizer, criterion)

Then the output I got in this way is:

Max memory before training: 487
Max memory allocated: 15402
Max memory allocated: 13615
Max memory allocated: 13591
Max memory allocated: 13591
Max memory allocated: 13591
Max memory allocated: 13591

What could be the reason of that the GPU memory usage of the first batch is larger than the following ones? This would lead to OOM if I use slightly larger model. Is there any ways to reduce the extra usage of memory?

cudnn benchmarking will try different algorithms for your input shape and model.
Some algorithms trade memory for speed, so you might see a higher memory usage and a slowdown (due to multiple profiling runs) for the first iteration.

This shouldn’t yield to an OOM error, as cudnn shouldn’t use an algo which uses too much memory.

1 Like

Thanks for your reply! The memory reduces especially for the first batch after the cudnn benchmarking is commented. But the running time takes 30% longer.

From the output I posted above, I assume that the memory used by benchmarking algorithms can exceed 1.5GB. For example, if 15 GB out of 16GB is occupied by the model training itself, will this get an OOM error? I just wonder is there a way to so something like making cudnn benchmarking try these algorithms before the training to reduce the peak memory allocated?

That’s expected, since cudnn benchmarking will try to find the fastest algorithm for your workload.

This shouldn’t be the case and cudnn should discard all algorithms which could cause an OOM error. Have you seen an OOM using cudnn.benchmark?

Benchmarking will run for the first iteration and cache the fastest algorithms for the current setup.
If you don’t want to trigger it in the first training iteration, you could pass a dummy tensor with the same shape as your training samples before entering the training loop.
Note that each different input shape will trigger the benchmarking again, so if you are dealing with a lot of different shapes, your code might run faster, if you disable it.

Actually my real training code is a complicated version of the posted one. It’s not just a cross entropy loss but a distributed manner in which every process computes hand-crafted loss and gradients. Several tensor are used as intermediate variables. Will cudnn take these into consideration when choosing algorithms?