Is there any way to enable logging of collectives calls?
This is useful for debugging deadlocks.
I could think of monkey-patching all distributed methods to print rank/process group/called collective, but I wonder if there is already an existing option / env. variable
And would such Python-level monkey-patching work for C++ level collectives’ calls? Probably not, any more reliable way / hooks?
There’s Distributed communication package - torch.distributed — PyTorch master documentation, but from this page it’s unclear whether any level of
TORCH_DISTRIBUTED_DEBUG enables very simple tracing of collectives’ calls.