How to distribute training across ALL GPUs

I have an object detection task. I am using torch.nn.DataParallel over the model and using torch.device('cuda') as the device at the time of training. I see that only one of the GPUs does most of the processing, and the other two are given only 1-1.5 GB of data. Here is the code for setting up the training:

model = torch.nn.DataParallel(model)
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=1e-3, momentum=0.9, weight_decay=0.005)
num_epochs = 20
model, best_model = train_model(model=model,
                                optimizer=optimizer,
                                data_loader=data_loader_train,
                                device=torch.device('cuda'),
                                num_epochs=num_epochs)

I am using a batch size of 32. If I increase the batch size, then then the GPU runs out of memory. If I use a batch size of 32, then the workload is very unevenly distributed. Is there any way to evenly distribute the workload so that I can increase the batch size?

@Laya1 Would recommend using DistributedDataParallel — PyTorch 1.10 documentation instead of torch.nn.DataParallel. You can find a tutorial for it here: Getting Started with Distributed Data Parallel — PyTorch Tutorials 1.10.1+cu102 documentation

Thank you for the reply.
I tried to use DDP, but I cannot figure out the configurations for os.environ. I have one computer with 3 GPUs, so I used the following settings based on a few posts online:

s.environ['MASTER_ADDR'] = '127.0.0.1'
with closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as s:
    s.bind(('', 0))
    s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    temp = str(s.getsockname()[1])
os.environ['MASTER_PORT'] = temp
torch.distributed.init_process_group(backend='nccl', world_size=3, rank=0)
model = DDP(model)

The compiler goes into some sort of infinite loop at init_process_group. I am not sure what is happening here and where I have made a mistake.
After this, I switched to DataParallel which at least executed, but with different loads on different GPUs. I just want to use all 3 GPUs - whichever way possible. Thanks