How to write training loop for MaskRCNN Distributed RPC

I am trying to train MaskRCNN (class from Torchvision) on 3 machines (master, worker-1 and worker-2) (also multiple GPUs per machine) using distributed RPC & with model parallelism. So far I was able to divide the whole architecture onto 2 machines (manually). So far I have used below resources extensively to do this,

Finetuning MaskRCNN in general: TorchVision Object Detection Finetuning Tutorial — PyTorch Tutorials 1.7.1 documentation
Distributed RPC: Distributed Pipeline Parallelism Using RPC — PyTorch Tutorials 1.7.1 documentation
Torchvision github

Now only thing left is to write the training loop, where I am stuck. I have observed the training loop used in this. And I need to convert this code to distributed training loop. How should I edit https://github.com/pytorch/vision/tree/master/references/detection/engine.py (Which contains training loop for training mask-rcnn as linked previously) to achieve this? Also training loop for distributed rpc is also given in this example (As mentioned above) for classification task. How to combine these 2 to train mask rcnn in distributed way (with model parallelism)?

EDIT :

Question-2:

Second shard of the model which resides on the second machine, has 2 modules. Each of them is allocated 1 GPU on that machine (meaning module 1 on GPU-1, module 2 on GPU-2). And I have other 2 GPUs on the same machine, which are ideal. Dividing these 2 modules on 4 GPUs equally mean overriding their forward methods, which are very complicated (Talking about RPN and RoI Heads module). Can I do something like this, put module-1 on GPU-1 & GPU-2, module-2 on GPU-3 & GPU-4, now when input batch comes split that equally into 2 parts, process them like this, part1 → module-1 (GPU:1) → module-2 (GPU:3) & part2 → module-1 (GPU:2) → module-2 (GPU:4) (of course, these 2 workflows will run concurrently, at the end their losses will be averaged and avg. loss will be returned to master node). I have already referred to following articles,

https://pytorch.org/tutorials/advanced/rpc_ddp_tutorial.html

https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html

Is it possible to perform above explained case in PyTorch?

To be clear, since you already know how to split the model into two model shards, is your question how to convert the following lines in https://github.com/pytorch/vision/blob/master/references/detection/engine.py:

        optimizer.zero_grad()
        losses.backward()
        optimizer.step()

into the code in RPC Framework Tutorial?

        with dist_autograd.context() as context_id:
            outputs = model(inputs)
            dist_autograd.backward(context_id, [loss_fn(outputs, labels)])
            opt.step(context_id)

cc: @mrshenli

I know about the conversion you mentioned in your answer. But what should be done to these lines in engine.py. Should I keep it as it is?

    # reduce losses over all GPUs for logging purposes
    loss_dict_reduced = utils.reduce_dict(loss_dict)
    losses_reduced = sum(loss for loss in loss_dict_reduced.values())

I have added another question please take look at that. Thanks for answering my questions.

I know about the conversion you mentioned in your answer. But what should be done to these lines in engine.py. Should I keep it as it is?

Hey @matrix, reduce_dict internally uses collective communication. If you need that, you will need to setup the process group with init_process_group properly in the way that only the processes that hold loss_dict form a gang.

However, since you already use RPC, do you still need reduce_dict? Will it work if you send the loss to the same process, reduce those loss their locally (without using collective communication), and then launch distributed backward from there?