Slightly quicker forward pass when looping through batch in forwad function

Hello

I 'm having this weird thing in PyTorch when i pass my data in the forward function, i will have a slightly faster forward pass if i loop through each element in the batch and pass it individually. My batch is of shape [8,18,500,500] where 8 is the batch_size and it is passed on T4 GPU. is there any reason for this behavior?

Thanks

Could you post a minimal, executable code snippet showing this behavior please?