I just upgraded my HW to A6000 from Titan and noticed something unexpected. On Titan, smi would report GPU memory usage around 21G - 23G (can’t remember the exact number but Titan has 24G so had to be less than that). Now that same code is using 38G on A6000.
My dataloader does not know the HW I am using, so its parameters are fixed.
The only other thing I changed was to upgrade Dockerfile from pytorch/pytorch:1.6.0-cuda10.1-cudnn7-runtime to pytorch/pytorch:1.7.1-cuda11.0-cudnn8-runtime since A600 was not happy with the original.
Is this PyTorch just leaving stuff in memory since there is room or is there a default data structure that changed size in PyTorch?