Distributed Data Parallel slower than Data Parallel

The code snippet in this comment can serve as an example. Search for torch.cuda.Event.