Dataloaders and Cuda management

If I’m not mistaken, cudaMalloc rounds to 2MB blocks in newer GPUs, which is exactly (2097152 / 1024**2), which would explain the minimal cache size.

You might have a data loading bottleneck.
Profile your data loading overhead using the code from the ImageNet example. If you see some overhead in loading the data, check this post from @rwightman, where he explains some possible workarounds.