For the multi-gpu training on single machine, the function gather(outputs, target_device, dim=0) in pytorch/torch/nn/parallel/scatter_gather.py supports autograd.
For the multi-machine distributed training, the function gather(tensor, **kwargs) in pytorch/torch/distributed/init.py does not support autograd.
Is my understanding correct?