PyTorch supports GPU-to-GPU communication in a distributed learning environment via gloo and NCCL.
I want to analyze the phenomenon of distributed learning with a high latency value, a low bandwidth value, or a slow communication speed. So, is it possible to manually change (i.e. decrease) the communication speed between GPUs? Additionally, can we manually change the latency value and bandwidth of the GPU-to-GPU communication?
Hey @Yuki_Takezawa, AFAIK, there is no user APIs to directly change Gloo/NCCL speed. If you would like to mimic the behavior of a slower interconnect, can this be done at a higher layer? E.g., can you wrap collective APIs to manually insert delay?
Yes, we can insert a delay (e.g., time.sleep(1.0)) before dist.send() and can simulate the slow communication network. However, I want to simulate a slow communication network in a more realistic environment.
For instance, if we run distributed learning of PyTorch with multiple virtual machines, we can manually set the communication speed by tc command. However, since it is complicated to set up this environment, I am looking for an easier way.
I found that using docker is a relatively easy way.
The environment of a network between docker containers (e.g., latency, bandwidth) can be also set by tc command.