torch.nn.DataParallel(): Why GPUs use different amount of VRAM?

Hi all,

I am trying to fine-tune a GPT2 model, following the tutorial here:

At some point, the code uses torch.nn.DataParallel(), which uses both my GPUs. However, the usage of GPUs is imbalanced, for example the training occupies 22GB on the first GPU, while only 9GB in the second card.
Why is this happening? Why the first GPU uses 13GB VRAM more? What is stored there, that is not stored in the second GPU?

This shortcoming is expected in nn.DataParallel and is one reason I generally recommend to use DistributedDataParallel as well as for performance reason. The underlying mechanics and reason of it are explained in this blog post.