How to debug pytorch distributed?

It can be tricky to use python debugger from a multi-rank setup. The first thing you’d notice if you try this is that pdb may crash your program if you use it from inside a mpirun or torchrun launcher. Fortunately, this is fixable and you can use pdb almost like usual.

There is a catch- it’s not too easy to attach the debugger on each rank, but it’s pretty easy to attach it to just one particular rank (and let all the other ranks pause).

This PR from @ezyang adds a new helper called torch.distributed.breakpoint. It can be used more or less like python’s breakpoitn statement, except you’re supposed to have it called on all ranks (but always pass the same int for rank, so across all ranks one rank in particular is the one that will listen for the debugger input).

3 Likes