Torch.distributed.launch module and data batch_size

I try to implement multi-processing training. Since I use python2.7, the torch.distributed.spawn is not supported for py2.7 now. So I use torch.distributed.launch to distribute my training program. But I found several problem when use launch module:

  1. The batch_size use in dataloader seems is the batch_size for each gpu
  2. the program seems not communicate with each other. It seems that each gpu perform it’s own dataset batch seperation. Because the total batch number becomes the len(dataset)/(batch_size for one gpu)

Should I implement the all_reduce function to gather all gradients?
ps:I test on single node.