Hi, I am trying to figure out how to modify the DataLoader to permit a different batch size per device. I need to do this as I have different GPUs with different memory and tensor core sizes on the same machine.
I have been looking at the implementation of the DataLoader and it seems the appropriate thing to do would be to use a modified batch_sampler. A LoadBalancedBatchSampler class would inspect the rank of the process (similar to what DistributedSampler does) and then modify the batch_size per rank before the yield loop.
DistributedSampler would also have to be modified so that I can get the same number of iterations in each device.
I am moving in the right direction? Have I missed something?
Thank you in advance.