Thank you for the instant reply!
My code is not so small to easily shared:
Its a transformer NMT model. (In the above case, the model was wrapped in data-parallel on 4 GPUs.)
I am worried the fragmentation issue is not deterministically reproducible from my code alone, because it is mainly due to data set.
I need to pack a small dataset, with manually seeding the RNGs to make it reproducible. On smaller datasets, it is not so an issue as there is less variance (described below). I will follow-up when I have all those requirements checked reproducing on your machine.
Let me describe my situation:
The reason for such a bad fragmentation - I think - is because the training data is text with unequal sentence lengths. Each batch can have its own sequence length with paddings, and we shuffle the batches, so there can be a large variance between the memory requirement for any two consecutive batches.
We improved the situation with
batch_size= Bx L ; where
L=#tokens in sentence including padding.
BxL has a limit of batch_size, both B and L can vary individually between any consecutive batches. e.g. for BxL= 4096 and for the model of dim
In an extreme case, we are seeing the tensors of shape
But on an average case, due to variance in lengths, the BxL is NOT guaranteed to be 4096 exactly, it can be a few bytes lower sometimes, say 10x409 = 4090 for example. Hence there are still tiny variations (spikes and surges) in memory usage across batches. I suspect these variations lead to defragmentation in the long run eventually causing Cuda OOM.
With this new information, is there any new advice for me to improve the fragmentation situation?