Apex amp.DistributedDataParallel

I want to switch from torch DDP to apex amp DDP to make use of mixed precision. I’m training two DDP models so I’m using two process groups to make sure the gradients are synchronized correctly. Is there a way to pass a process group to amp DPP to get the same performance?

You don’t need to switch to apex/DDP to use automatic mixed-precision.
We recommend to use the PyTorch implementations of DDP and the native mixed-precision implementation via torch.cuda.amp.

Oh, I see. Thanks a lot! Didn’t know about the torch.cuda.amp
It works with the amp and I am able to fit a batch_size twice as big on the GPU now.
One question though, I’m actually experiencing a slowdown with the increase of batch_size
Is this expected behaviour? I was hoping to get the same step execution time, but a bigger batch_size so the epoch gets faster.

It depends a bit where the current bottleneck is.
Using amp should reduce the execution time, if TensorCores can be used. However, increasing the batch size would also increase the time again. The net benefit depends on the achieved speedup through the usage of TensorCores vs. the increased workload.
That being said, increasing the batch size also increases the data loading time, since each worker has to load more samples now and your current setup might face a data loading bottleneck.
You could profile the data loading as shown in the ImageNet example and check, if this time decreases during the training, which would mean that all workers can preload the batches in the background while the GPU is busy with the training.

I did the profiling as you advised, and the dataloader is not the issue here (the time is almost the same)
But I noticed the per step slow down still is a speed up in terms of throughput. So now it takes x1.2 less time to process the same amount of objects than before. So I guess that’s how it’s supposed to be. Thanks a lot for your help!