Since you have multiple GPUs, you could use nn.DataParallel to utilize all or some of them.
Have a look at this tutorial to apply it.
Basically your batch will be split into chunks in the batch dimension and pushed to all specified devices.
Also to speed up the data loading you should use multiprocessing in your DataLoader by setting num_workers>0.
From the tutorial, I understand that I need to Parallelize my model before moving it to device.
Since if I move it model.to(device) it will by default copy it to just 1 GPU regardless how many GPUs are available in torch.cuda.device_count(). Is this right?
I am using the dataloader as follows. Is the implementation correct?
The gradients will be reduced to the GPU you are specifying, so you might see a slightly increased memory usage on this device.
The DataLoaders look good. Since you are using GPUs, you should also set pin_memory=True to use the pinned host memory as the GPU cannot access data directly from pageable host memory.