Forward (eval mode) one sample vs many samples: gpu memory

hi,
any explanation to this:

case 1: forward minbatch: gpu mem required: 3gb

model = create_model()
model.eval()
x = torch.rand(32, 3, 224, 224)

with torch.no_grad():
  model(x)  # executing just this instruction takes 3gb

case 2: forward one sample. gpu mem required: 11gb

model = create_model()
model.eval()
x = torch.rand(32, 3, 224, 224)

with torch.no_grad():
  model(x[0].unsqueeze(0))  # executing only this instruction: gpu memory 11gb.

what possibly can cause this?

torch 1.10.0
forward is done within ‘torch.cuda.amp.autocast(enabled=True)’

thanks

Depending on your setup and in particular if you are using cudnn.benchmark other kernels with a different memory requirement might be picked. Could you post a minimal executable code snippet as well as the used device, which would show the difference in memory usage, please?