Where to find documentation of ProcessGroup/DeviceMesh methods?

anonymous_user_83259 · December 1, 2024, 7:37am

A seemingly simple question.

I lost some debugging time to group.allreduce(), not realizing it returned an async Work handle, unlike global torch.distributed.* methods which sync by default.

To avoid similar mistakes, I searched for documentation of its methods, but failed to find it.

paulge · December 1, 2024, 9:10am

They should be part of the torch.distributed docs here. It definitely contains the docs for DeviceMesh and the other distributed communications API

anonymous_user_83259 · December 1, 2024, 9:24am

Are the methods of DeviceMesh listed on that page? What about ProcessGroup’s?

paulge · December 1, 2024, 10:22am

If you click on ‘[Source]’, the source code with all methods of DeviceMesh will unfold. Alternatively have a look at GitHub DeviceMesh class
The functions regarding ProcessGroup are under the ‘Collective functions’ section. Or here in the .pyi interface here ProcessGroup class

anonymous_user_83259 · December 1, 2024, 12:31pm

I check the source code of DeviceMesh often in lieu of documentation. I seek documentation.

The link to the ProcessGroup class points to the same file I linked in the root thread post. I shan’t engage more.

fduwjj · December 2, 2024, 6:44pm

cc: @irisz I think we do have documents for DeviceMesh already?