How to measure DDP time breakdown?

Hi, I am trying to use DistribuedDataDarallel for multi-node data-parallelism.

I want to know how can I measure the time breakdown for data load, forward, backward, communication?

Also, for calculating FLops, I am going to use the repository[Calculating flops of a given pytorch model] in github. Does anyone know the good way for calculating Flops.

If the program uses GPUs, you can use elapsed_time to measure the time spent on forward, backward, and optimizer. It is harder to break down computation and communication of the backward pass, as DDP tries to overlap these two and DDP conducts communication on dedicated CUDA streams that are not visible from the application side. Besides, communications are launched as soon as a gradient bucket is ready, meaning that it may or may not always saturate the bandwidth. To get around this, you can run local forward-backward, and then explicitly using allreduce from application side to conduct gradient synchronization after the backward pass. This will expose opportunities to measure that from application.