Simulating federated learning on single machine multiple gpus

Hi,
I’m trying to implement federated learning, with a server and 100 clients, on a machine with 8 GPUs. Each client has its private model and its own private dataset. In the parallel setup, each client is taking some local steps, and after some time server asks a group of clients to send their models to the server.
However, I’m implementing a simulator and all these actions are happening consecutively, which means the server chooses a group of clients and makes each of them take their local steps one by one. This takes so long and I want to do this part in parallel. This means I want to make 8 clients take their steps simultaneously, each of which on a different GPU separately. With this, I hope to get an x8 speedup.

Hey @Shayan_Talaei, are you asking for what are the recommended ways to parallelize your system?

which means the server chooses a group of clients and makes each of them take their local steps one by one. This takes so long and I want to do this part in parallel.

This looks like the CPU-launching overhead is the bottleneck and therefore not fully utilize the GPU capacity. You can verify if this is indeed the source of slowness using the profiler: PyTorch Profiler — PyTorch Tutorials 2.1.1+cu121 documentation

To parallelize it, you can try multi-threading or multi-processing. PyTorch does drop Python GIL when entering C++ side, so multi-threading should be able to help to some degree. This is also how DataParallel was implemented:

If you wanna completely get rid of GIL, then you will need multi-processing, a launcher to launch 8 processes, and tools to communicate across processes.

  1. Launcher: torchrun (Elastic Launch) — PyTorch 2.1 documentation
  2. Comm: Getting Started with Distributed RPC Framework — PyTorch Tutorials 2.1.1+cu121 documentation

Thank you so much @mrshenli. I implemented it via multi-threading and it works faster now. At some points, all the GPUs have non-zero utilizations which means they are running in parallel. However, during the server averaging times all of them have zero utilization which is inevitable I guess.

However, during the server averaging times all of them have zero utilization which is inevitable I guess.

Yep, looks so to me with synchronous training. To keep the GPUs busy during that, it might need some asynchrony.

Hi, I am encountering the same question now. How do you figure this out? Could you please share the code?