Understanding concurrency of Gloo on Caffe2

I have a question on the concurrency of caffe2 calling into Gloo. I see this max_concurrent_distributed_ops in caffe2’s resnet50-trainer.py example. I tuned it to a larger number but the concurrent calls to Gloo’s reduction algorithm, namely CudaHalvingDoubling is at most 12.

I’d like to understand how we can get faster training by turning this value up a bit more. Is this due to the dependency not being resolved, so that only 12 gradients can be worked on at the same time? Is there another limit to tweak (maybe the GPU has some concurrency limit?)

Please advice!