Reducing communication cost of synchronization step

Hi all,

I am implementing a new distributed optimizer using PyTorch.
Optimizer is converging well, but it requires synchronization of parameters every now and then.
Currently, the synchronization step takes a huge amount of time, rendering the whole parallelization approach useless.
At the moment, I am using a very straightforward approach to perform the synchronization step.
In particular, for corr (list of all model trainable parameters), I do:

          for c in corr: 
            dist.all_reduce(c, op=dist.ReduceOp.SUM)

I wonder, is it possible to reduce communication costs somehow?
I have seen that there are communication hooks, but they seem to operate only on the gradientBucket vector. Is it possible to also use them in other scenarios?

Many thanks,
Best regards,
Alena