I test execution time for these two lines in engine/trainer.py#L64. I called it as todevice time
images = images.to(device)
targets = [target.to(device) for target in targets]
I used 2 nodes, each node has 8 GPUs. Each GPU processes 2 images. I run cammand on first host (second host just replace with --node_rank=1):
export NGPUS=8
python -m torch.distributed.launch --nproc_per_node=$NGPUS \
--nnodes=2 --node_rank=0 --master_addr="172.17.61.2" --master_port=22876 \
tools/train_net.py --config-file "configs/e2e_faster_rcnn_R_50_FPN_1x.yaml" MODEL.RPN.FPN_POST_NMS_TOP_N_TRAIN 2000 SOLVER.IMS_PER_BATCH 32 SOLVER.BASE_LR 0.04 SOLVER.STEPS "(30000, 40000)" SOLVER.MAX_ITER 50000 TEST.IMS_PER_BATCH 16 OUTPUT_DIR models/tmp-2n8g
todevice time in 2 nodes(each 8GPUs) is doubled compared to 1 node 8GPUs. I also test other time such as data_time, backbone time, rpn time, backward time, step() update params time. All these time is so closed to one node with 8 GPUs.
I also test 2 nodes with 16GPUs on each. It’s the same that todevice time is twice than one node with 16GPUs, each GPU processes 2 images.
I am very confused. Each gpu processes the same number of images in both 2 situations. But time increased in distributed mode.