DistributedDataParallel on multiple GPU nodes slower than one GPU node

Two questions,

  1. Did you divide the the epoch size on each process by world_size?
  2. Will there be any contention on the data loader?

cc @osalpekar

Also cc @zhangguanheng66 for transformer questions