Modifying loader to balance load between different GPUs

Hi, I am trying to figure out how to modify the DataLoader to permit a different batch size per device. I need to do this as I have different GPUs with different memory and tensor core sizes on the same machine.

I have been looking at the implementation of the DataLoader and it seems the appropriate thing to do would be to use a modified batch_sampler. A LoadBalancedBatchSampler class would inspect the rank of the process (similar to what DistributedSampler does) and then modify the batch_size per rank before the yield loop.
DistributedSampler would also have to be modified so that I can get the same number of iterations in each device.

I am moving in the right direction? Have I missed something?

Thank you in advance.

cc @VitalyFedyunin for data loader questions