Transfer data to GPU doubled in distributed training

I test execution time for these two lines in engine/trainer.py#L64. I called it as todevice time

        images = images.to(device)
        targets = [target.to(device) for target in targets]

I used 2 nodes, each node has 8 GPUs. Each GPU processes 2 images. I run cammand on first host (second host just replace with --node_rank=1):

export NGPUS=8
python -m torch.distributed.launch --nproc_per_node=$NGPUS \
--nnodes=2 --node_rank=0 --master_addr="172.17.61.2" --master_port=22876 \
tools/train_net.py --config-file "configs/e2e_faster_rcnn_R_50_FPN_1x.yaml" MODEL.RPN.FPN_POST_NMS_TOP_N_TRAIN 2000 SOLVER.IMS_PER_BATCH 32 SOLVER.BASE_LR 0.04 SOLVER.STEPS "(30000, 40000)" SOLVER.MAX_ITER 50000 TEST.IMS_PER_BATCH 16 OUTPUT_DIR models/tmp-2n8g

todevice time in 2 nodes(each 8GPUs) is doubled compared to 1 node 8GPUs. I also test other time such as data_time, backbone time, rpn time, backward time, step() update params time. All these time is so closed to one node with 8 GPUs.

I also test 2 nodes with 16GPUs on each. It’s the same that todevice time is twice than one node with 16GPUs, each GPU processes 2 images.

I am very confused. Each gpu processes the same number of images in both 2 situations. But time increased in distributed mode.

That is weird indeed. Can you isolate the problem to data loading (i.e. don’t train a model, just iterate over the data set)? Due to asynchronous nature of CUDA, wall clock time that attribute to data transfer is in fact caused by asynchronous execution of for example autograd, the optimizer, etc.