Why scatter_object_list does not support NCCL?

Hi, I was reading the torch.distributed doc, and I found that the doc say scatter_object_list does not support NCCL backend due to tensor based scatter is not supported. But the dist.scatter seems to support NCCL backend. I think these are confilict.

ref: Distributed communication package - torch.distributed — PyTorch 1.12 documentation

Could anyone explain why?

Thanks a lot!

I read the code, and it seems that in torch/distributed/distributed_c10d.py, function scatter_object_list does not move tensor_list to device before scatter. Why?

Hi, thanks for the raising the issue. Responses to the questions are below.

There does seem to be an inconsistency, it looks like NCCL scatter support was introduced recently https://github.com/pytorch/pytorch/pull/70029 and the support for scatter_object_list has not been updated. Created an issue on github to track this: dist.scatter_object_list() NCCL support · Issue #84571 · pytorch/pytorch · GitHub

scatter_object_list() is a distributed collective that is not specific to a certain backend. We cannot assume the device that is used, perhaps the user may have the tensors on CPU and use the gloo backend.

1 Like

This was fixed for 1.12 but I forgot to update the docstring.
The fix for the collective was in: [distributed] Handle object collectives and NCCL. by kumpera · Pull Request #79034 · pytorch/pytorch · GitHub

Fixed the docstring in [c10d] Fix docstring of scatter_object_list by kumpera · Pull Request #84596 · pytorch/pytorch · GitHub

1 Like