Can we change the communication speed of GLOO/NCCL manually?

PyTorch supports GPU-to-GPU communication in a distributed learning environment via gloo and NCCL.

I want to analyze the phenomenon of distributed learning with a high latency value, a low bandwidth value, or a slow communication speed. So, is it possible to manually change (i.e. decrease) the communication speed between GPUs? Additionally, can we manually change the latency value and bandwidth of the GPU-to-GPU communication?

Thank you.

Hey @Yuki_Takezawa, AFAIK, there is no user APIs to directly change Gloo/NCCL speed. If you would like to mimic the behavior of a slower interconnect, can this be done at a higher layer? E.g., can you wrap collective APIs to manually insert delay?

1 Like

Thank you for your suggestion.

Yes, we can insert a delay (e.g., time.sleep(1.0)) before dist.send() and can simulate the slow communication network. However, I want to simulate a slow communication network in a more realistic environment.

For instance, if we run distributed learning of PyTorch with multiple virtual machines, we can manually set the communication speed by tc command. However, since it is complicated to set up this environment, I am looking for an easier way.

Good point, Yuki. Actually, I have the same demand to simulate a wireless network. Wonder if you have determined the solution?

Thanks

I found that using docker is a relatively easy way.
The environment of a network between docker containers (e.g., latency, bandwidth) can be also set by tc command.