Can we change the communication speed of GLOO/NCCL manually?

Yuki_Takezawa · October 17, 2022, 9:10am

PyTorch supports GPU-to-GPU communication in a distributed learning environment via gloo and NCCL.

I want to analyze the phenomenon of distributed learning with a high latency value, a low bandwidth value, or a slow communication speed. So, is it possible to manually change (i.e. decrease) the communication speed between GPUs? Additionally, can we manually change the latency value and bandwidth of the GPU-to-GPU communication?

Thank you.

mrshenli · October 18, 2022, 3:14am

Hey @Yuki_Takezawa, AFAIK, there is no user APIs to directly change Gloo/NCCL speed. If you would like to mimic the behavior of a slower interconnect, can this be done at a higher layer? E.g., can you wrap collective APIs to manually insert delay?

Yuki_Takezawa · October 23, 2022, 6:00am

Thank you for your suggestion.

Yes, we can insert a delay (e.g., time.sleep(1.0)) before dist.send() and can simulate the slow communication network. However, I want to simulate a slow communication network in a more realistic environment.

For instance, if we run distributed learning of PyTorch with multiple virtual machines, we can manually set the communication speed by tc command. However, since it is complicated to set up this environment, I am looking for an easier way.

Chenning_Li · December 16, 2022, 9:24pm

Good point, Yuki. Actually, I have the same demand to simulate a wireless network. Wonder if you have determined the solution?

Thanks

Yuki_Takezawa · December 18, 2022, 5:32am

I found that using docker is a relatively easy way.
The environment of a network between docker containers (e.g., latency, bandwidth) can be also set by tc command.