Running the imagenet example on a single node, 4-GPU setup calls 3 NCCL AllReduce ops per mini-batch for gradient synchronization, with sizes 2052000
, 28852224
, and 15853824
bytes.
I assumed that each op will follow the bucket_cap_mb
limit, i.e. none of the allReduce would have sizes more than say, 25 MB (default)
Am I missing something?