Distributed Data Parallel allreduce

Is there a way to verify if allreduce operation is getting called in a multinode DDP training with nccl backend ? In my training the results of single node and distributed training appear similar. @mrshenli @apaszke

One option is to use nvprof.

In my training the results of single node and distributed training appear similar.

You mean speed is similar? What is the batch size fed into each DDP instance? When using DDP, the batch_size should be updated to original_batch_size/world_size.

No. I have divided the batch size by the world size.
Will check out nvprof and also create minimal working example as I can not share code.