A memory footprint gap between the first loop and the following loop

JohnsonLC · April 28, 2022, 8:37am

I have an evaluation code like this,

"""
Load models and model parameters to the GPU
"""
for items in my_dataloader:
    """
    Load items(input data) to GPU
    """
    torch.cuda.reset_peak_memory_stats()
    loss = loss_fcn()       // models forward
    print(f'Device {torch.cuda.current_device()}: ', torch.cuda.max_memory_allocated() / 1024**2, ' MB.')

The ouput in terminal is:

Device 0: 9294.86962890625 MB.
Device 0: 1775.291015625 MB.
Device 0: 1774.791015625 MB.
Device 0: 1775.2998046875 MB.
…

The question is why there is a huge memory footprint gap between the first loop and the following loops?
And I also test this during training. The output is:

Device 0: 8377.93115234375 MB.
Device 0: 6737.173828125 MB.
Device 0: 6686.787109375 MB.
Device 0: 6740.197265625 MB.
…

There is also a gap between the first loop and the following loops. So what’s the reason? Thanks in advance.

eqy · April 28, 2022, 7:49pm

Are you running with torch.backends.cudnn.benchmark=True? It could be possible that some kernels benchmarked during the first iteration are using significant memory for workspaces but not chosen to be used for later iterations.

JohnsonLC · April 29, 2022, 3:33am

Yes, you are right! I did run my program with torch.backends.cudnn.benchmark=True.

After I set it False, the output in terminal is:

Device 0: 2773.99951171875 MB.
Device 0: 2781.44580078125 MB.
Device 0: 2782.21533203125 MB.
Device 0: 2780.44189453125 MB.
…

During training, the output is:

Device 0: 7446.09619140625 MB.
Device 0: 7954.59521484375 MB.
Device 0: 7948.68896484375 MB.
Device 0: 7951.12646484375 MB.
…

It looks like cudnn is attempting to spend more space in the first loop to save space and time for the following loops. Thanks very much!

eqy · April 29, 2022, 4:21am

Sure, and to clarify on the last point, cuDNN is not exactly trying to save memory, but in the first iteration it will try many strategies for convolution, and some of these strategies will use more temporary memory (called workspaces by cuDNN). If these strategies are not the fastest ones (benchmarking only cares about speed, not space), they will not be used on the next iteration so the temporary memory usage will go down.