What's the recommended way of profiling DistributedDataParallel pytorch code?

To see how much time is spent on interprocess communication, allreduce, etc…; as well as static memory usage and amount of information transferred during interprocess communciation

nvprof from Nvidia is probably the best tool available: CUDA Pro Tip: nvprof is Your Handy Universal GPU Profiler | NVIDIA Developer Blog

DDP currently does not work with the autograd profiler