Model Memory Footprint Increases on New Machine

I’m trying to debug a situation where running the same model/training code on a new instance has about 3x the GPU memory usage.

Instance 1:
P2, Cuda 10.2 (driver 440.33.01), Pytorch 1.5

Instance 2:
P2 (Sagemaker), Cuda 11.0 (driver 450.51.05), Pytorch 1.6

The model I’m running is a LSTM. On instance 1, training with a batch size of 512 uses 2.3 GB of GPU memory. On instance 2, training with a batch size of 512 uses 9.6 GB of GPU memory.

Both instances are running the same code (same commit) and the same model.

Does anyone know how I might go about debugging this?

How are you measuring the memory usage? Are you using nvidia-smi or torch.cuda.memory_allocated() and other PyTorch methods?
Also, are you using torch.backends.cudnn.benchmark=True and what kind of GPUs are you using in those machines?