How to run a function only in the slowest process in torch.distributed?

Suppose I have 8 GPUs and each running some operations that produce different outputs. After aggregating, I would like to modify a variable self.x with this aggregated results. From what I understand if I write self.x = aggregate, all 8 processes will run this line once, which means in total 8 times. Is that right? If it is, how do I control this to be run only in the slowest process, so that it is only run once.

*It probably doesn’t affect the result, but it just feels “ugly” to me. I am not sure if it is normal for this kind of code to be written when writing distributed code. Hopefully someone can advise.

You can synchronize the processes and then perform the variable update on your desired rank process.
Then you will need to distribute the aggregated value to the other ranks.

import torch
import torch.distributed as dist

RANK = 0 #  your target rank
dist.barrier() # synchronize (wait for the slowest process)
if dist.get_rank() == RANK:
    self.x = aggregate()

how to distribute to other ranks? i presume all ranks will see the same self.x? or will all process duplicate and keep their own version of self variables?

As all processes are independent from each other without communications, self.x should be explicitly handled for all processes with different ranks to have the same value.
You can refer to the collective functions in pytorch distributed docs.

You may choose to broadcast or all_reduce, etc.

Ok. A side question. Do you know if all_gather gathers tensors in the same order? I need to all gather 2 tensors and the they have to be all gathered in the same order so that item i in the first list can be matched to item i in the second list.

I haven’t used it but I expect so.
You can try it out and check. :slight_smile: