Hi, my understanding is that currently DataParallel splits a large batch into small batches evenly (i.e., each worker receives the same number of examples). I wonder if it is possible to let each GPU gets different numbers of examples.
The motivation is that, the synchronous reduce has to wait all GPUs to finish their work for each batch. If I have different GPUs that have very different speed (e.g., a Titan V and a 1070), then the fast GPU will have to wait for the slow one to finish.
To minimize the synchronization time , I want to set a small batch size on 1070 to let it calculates the batch faster. Furthermore, it will be great if some algorithms can adjust the batch size automatically (E.g., if one worker used longer time to finish, allocates less examples to it but sends more examples to the faster workers.)