How to compare compute time and batch size between serial and parallel GPU runs

Hi,
I have a question about comparing wall times between parallel and serial GPU runs. My code uses DistributedDataParallel for GPU parallelization. I want to compare the wall time between using a single GPU and using multiple GPUs, but I’m not really sure how to handle the wall times with batching for each scenario. For example, if the serial GPU run uses a batch size of 10, and the DDP run on 2 GPUs uses a batch size of 10 (but split into 2 batches of 5 on each GPU), then can the wall times from each training routine be directly compared? I guess my confusion comes down to whether or not I should use a batch size of 20 when running on 2 GPUS (split into 2 batches of 10 on each GPU), or a batch size of 10 (split into 5 on each GPU), to compare with a serial run using a batch size of 10. Hopefully this question makes sense, and I’m happy to answer any questions.

Usually you would scale the global batch size with the number of GPUs. I.e. if you double the number of GPUs you would also double the global batch size, which corresponds to the same local batch size.