DataParallel: how to set different batch size on different GPUs

matthew_zeng · November 8, 2018, 8:45am

Hi, my understanding is that currently DataParallel splits a large batch into small batches evenly (i.e., each worker receives the same number of examples). I wonder if it is possible to let each GPU gets different numbers of examples.

The motivation is that, the synchronous reduce has to wait all GPUs to finish their work for each batch. If I have different GPUs that have very different speed (e.g., a Titan V and a 1070), then the fast GPU will have to wait for the slow one to finish.

To minimize the synchronization time , I want to set a small batch size on 1070 to let it calculates the batch faster. Furthermore, it will be great if some algorithms can adjust the batch size automatically (E.g., if one worker used longer time to finish, allocates less examples to it but sends more examples to the faster workers.)

fallcat · March 31, 2019, 8:10pm

I want to know too. I’m training with a data that is not divisible by the minibatch size, so the last batch has a number that cannot be divided by the number of GPUs. And I’m getting an error when I tried to use DataParallel.

dsuess · March 31, 2019, 10:39pm

You could use the drop_last=True option for DataLoader to only return batches with the correct batch size.

fallcat · March 31, 2019, 10:54pm

But then the last few data will never be trained on.

dsuess · March 31, 2019, 10:57pm

Since you shuffle your data (and hopefully your batch size is much smaller than your total dataset) I think that that’s okay.

I’d argue that a smaller batch is more problematic since your hyper parameters are tuned to your original batch size.