"""
Load models and model parameters to the GPU
"""
for items in my_dataloader:
"""
Load items(input data) to GPU
"""
torch.cuda.reset_peak_memory_stats()
loss = loss_fcn() // models forward
print(f'Device {torch.cuda.current_device()}: ', torch.cuda.max_memory_allocated() / 1024**2, ' MB.')
The question is why there is a huge memory footprint gap between the first loop and the following loops?
And I also test this during training. The output is:
Are you running with torch.backends.cudnn.benchmark=True? It could be possible that some kernels benchmarked during the first iteration are using significant memory for workspaces but not chosen to be used for later iterations.
Sure, and to clarify on the last point, cuDNN is not exactly trying to save memory, but in the first iteration it will try many strategies for convolution, and some of these strategies will use more temporary memory (called workspaces by cuDNN). If these strategies are not the fastest ones (benchmarking only cares about speed, not space), they will not be used on the next iteration so the temporary memory usage will go down.