Hi, I was reading the torch.distributed doc, and I found that the doc say
scatter_object_list does not support NCCL backend due to tensor based scatter is not supported. But the
dist.scatter seems to support NCCL backend. I think these are confilict.
ref: Distributed communication package - torch.distributed — PyTorch 1.12 documentation
Could anyone explain why?
Thanks a lot!
I read the code, and it seems that in
scatter_object_list does not move
tensor_list to device before scatter. Why?
Hi, thanks for the raising the issue. Responses to the questions are below.
There does seem to be an inconsistency, it looks like NCCL scatter support was introduced recently Implement scatter primitive for ProcessGroupNCCL by wanchaol · Pull Request #70029 · pytorch/pytorch · GitHub and the support for scatter_object_list has not been updated. Created an issue on github to track this: dist.scatter_object_list() NCCL support · Issue #84571 · pytorch/pytorch · GitHub
scatter_object_list() is a distributed collective that is not specific to a certain backend. We cannot assume the device that is used, perhaps the user may have the tensors on CPU and use the