Torch.distributed collectives call logging

vadimkantorov · February 16, 2023, 9:31am

Is there any way to enable logging of collectives calls?

This is useful for debugging deadlocks.

I could think of monkey-patching all distributed methods to print rank/process group/called collective, but I wonder if there is already an existing option / env. variable

And would such Python-level monkey-patching work for C++ level collectives’ calls? Probably not, any more reliable way / hooks?

There’s Distributed communication package - torch.distributed — PyTorch master documentation, but from this page it’s unclear whether any level of TORCH_DISTRIBUTED_DEBUG enables very simple tracing of collectives’ calls.