I try to implement multi-processing training. Since I use python2.7, the torch.distributed.spawn is not supported for py2.7 now. So I use torch.distributed.launch to distribute my training program. But I found several problem when use launch module:
- The batch_size use in dataloader seems is the batch_size for each gpu
- the program seems not communicate with each other. It seems that each gpu perform it’s own dataset batch seperation. Because the total batch number becomes the len(dataset)/(batch_size for one gpu)
Should I implement the all_reduce function to gather all gradients?
ps:I test on single node.