I have a question on the concurrency of caffe2 calling into Gloo. I see this max_concurrent_distributed_ops
in caffe2’s resnet50-trainer.py
example. I tuned it to a larger number but the concurrent calls to Gloo’s reduction algorithm, namely CudaHalvingDoubling
is at most 12.
I’d like to understand how we can get faster training by turning this value up a bit more. Is this due to the dependency not being resolved, so that only 12 gradients can be worked on at the same time? Is there another limit to tweak (maybe the GPU has some concurrency limit?)
Please advice!
Thanks!