If I’m not mistaken, cudaMalloc
rounds to 2MB blocks in newer GPUs, which is exactly (2097152 / 1024**2)
, which would explain the minimal cache size.
You might have a data loading bottleneck.
Profile your data loading overhead using the code from the ImageNet example. If you see some overhead in loading the data, check this post from @rwightman, where he explains some possible workarounds.