Running the imagenet example on a single node, 4-GPU setup calls 3 NCCL AllReduce ops per mini-batch for gradient synchronization, with sizes
I assumed that each op will follow the
bucket_cap_mb limit, i.e. none of the allReduce would have sizes more than say, 25 MB (default)
Am I missing something?