How to train a model on multiple GPUs?

I am using HuggingFace x PyTorch, and I have a model instance of the RobertaForMaskedLM class, and I wish to train the model on Masked Language Modeling on a dataset with roughly 2M data points with a max seq length of 512, and a total batch size of 2048. Now, I have 4 GPUs at my disposal but am unable to utilize all of them to evenly distribute the batches among them. What I wish to do is send batches each of size 2048/4 = 512 to the 4 GPUs, to achieve a cumulative batch size of 2048.
I tried using model = nn.DataParallel(model.cuda()), but I get a CUDA out of memory error for every batch size I use (even as low as 8). I am creating the dataloaders as follows:

train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size = data_args.train_batch_size, collate_fn = data_collator, pin_memory=True, num_workers=4)

Where train_batch_size=2048

I would recommend to use DistributedDataParallel as it would avoid the overheads introduced by nn.DataParallel and would give you a better speedup.

Could you please link me to a tutorial to do so?

This is an official tutorial.

If DDP is overwhelming you can consider Accelerator