Unresonable GPU memory consumption when truncating padding tokens

I am training pytorch model with Contrastive Loss (pytorch-metric-learning) and there are two ways I can batch text examples. In the first method, I am padding only up to maximal sequence length in a batch. In the second case, I am padding always up to 510 tokens no matter what the sequence size is. In general, tensor size in the first method is always smaller or equal to the second method. Yet, what I am observing is that smaller batch (the whole training) is actually consuming more memory than the constant, larger batch. Additionally, it produces CUDA out of memory error at some point in training (during a backward call). Green line in the Figure corresponds to memmory consumption for a smaller, dynamic batch. Grey curve for larger, constant size batch. I would expect an inverse situation. What is going on? How to even debug this kind errors?

I am training with pytorch-lightning, pytorch==1.5.0 and amp O2 (cuda memory also shows up without amp)

Is the padding the only difference between these approaches?
If so, could you show a code snippet to demonstrate this behavior?
My first guess would be memory fragmentation, but that’s way too large and the memory should also be reused, so you shouldn’t run out of memory.

Thanks for answer @ptrblck! Yes, padding is the only difference in this configuration. I am using huggingface transformers library and noticed that the amount of padding actually changes slightly loss value due to dropout. Anyway, I will try to prepare full example using some public dataset (I can’t share company’s data) and reach out to you.