Hi, yes I did get it to work in the end.
As @ptrblck mentioned nn.DataParallel doesn’t work in the same way you might be used to when using multiple GPUs, as in these object detection models the replica models are not independent hence why nn.DistributedDataParallel is needed.
Also, for this to work, the script has to be launched in a very specific way so that everything works properly in a distributed manner:
python -m torch.distributed.launch --nproc_per_node=2 --use_env name_of_your_training_script.py
I found this here. There is also a GitHub discussion around some other questions I had. This ensures that each process is spawned correctly and they are able to talk to each other (this is very similar to how TensorFlow distributed works too).
If you are customising the example code (for example using a different backbone), it’s worth reading a lot of the boilerplate code in the example to understand how certain variables are set for torch.distributed.launch to work properly
Additionally, if you want to use mixed precision training you might find this useful