I am using DDP for multi-host training. I tend to investigate the collective communication process, and therefore wonder if there is any statistics information or logs available.
For example, it would be helpful for me to know when and how many times the
allreduce operation is called in each training epoch and the distribution of job sizes. Are there any methods to get such information?