I am using HuggingFace x PyTorch, and I have a model instance of the RobertaForMaskedLM
class, and I wish to train the model on Masked Language Modeling on a dataset with roughly 2M data points with a max seq length of 512, and a total batch size of 2048. Now, I have 4 GPUs at my disposal but am unable to utilize all of them to evenly distribute the batches among them. What I wish to do is send batches each of size 2048/4 = 512
to the 4 GPUs, to achieve a cumulative batch size of 2048.
I tried using model = nn.DataParallel(model.cuda())
, but I get a CUDA out of memory error for every batch size I use (even as low as 8). I am creating the dataloaders as follows:
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size = data_args.train_batch_size, collate_fn = data_collator, pin_memory=True, num_workers=4)
Where train_batch_size=2048