I am trying to fine-tune a GPT2 model, following the tutorial here:
At some point, the code uses
torch.nn.DataParallel(), which uses both my GPUs. However, the usage of GPUs is imbalanced, for example the training occupies 22GB on the first GPU, while only 9GB in the second card.
Why is this happening? Why the first GPU uses 13GB VRAM more? What is stored there, that is not stored in the second GPU?