Can we measure how much time a GPU waited during all-reduce?

Hi there. As all-reduce is a sync operation, if GPUs have difference in computing or i/o capability, faster GPUs will have to wait for slower ones. Is there a way to know how much time each GPU waited during an all-reduce operation?

@MaCasK by the time someone else answers your specific query, you can take a look at this post by @ptrblck

Thanks very much, but if I’m not misunderstanding, Nsight is only giving result externally? Is there some way to get the result within the training script so training can be adjusted according to it?
Also, to my knowledge Nsight is able to monitor GPU usage and I/O usage, but not for a particular process. The post you mentioned seems not so different from using time.time() or sth. else, not to seperate a all-reduce call into waiting and transferring. So I’m afraid this can’t solve my problem.
Anyway, thanks to your answer.

@MaCasK yup, i know i did not answer your question :grinning:

Are you using multi gpu in your home system or somewhere else. I am planning to buy another GPU for my personal use hence the question

Nsight Systems will profile all processes in a PyTorch DDP run and will show the NCCL reductions. From the timeline view you can then reason about the IDLE/wait time of different processes.

yes, i’m using multi gpus

so I need to reason it manually? or there is some way to do it automatically during training?