DataParallel imbalanced memory usage

We had the same issue, in that we could only train with a much smaller batch size when parallelizing.
Using DistributedDataParallel in both model and loss got us much better results. You have to use DistributedSampler and init_process_group, but it’s all in this example: https://github.com/pytorch/examples/blob/master/imagenet/main.py

However, we have not seen massive improvements in speed, probably due to our slow dataloader/data transfer as our input size is quite large…
Both methods, DistributedDataParallel or DataParallel, running on a AWS P3 with 8 GPUs barely improved at all compared to a single GPU (perhaps the variation on the time required for an epoch is reduced, but the average time is about the same). That doesn’t make much sense, has anyone seen the same problem?