Optimizers and multiprocessing: should I manually average gradients?

I have a similar question here. I simultaneously opened a query in pytorch/fairseq#779 to which the response was that there is built in averaging.

How about trying some black box experiments to figure out?