I’m trying to implement federated learning, with a server and 100 clients, on a machine with 8 GPUs. Each client has its private model and its own private dataset. In the parallel setup, each client is taking some local steps, and after some time server asks a group of clients to send their models to the server.
However, I’m implementing a simulator and all these actions are happening consecutively, which means the server chooses a group of clients and makes each of them take their local steps one by one. This takes so long and I want to do this part in parallel. This means I want to make 8 clients take their steps simultaneously, each of which on a different GPU separately. With this, I hope to get an x8 speedup.
Hey @Shayan_Talaei, are you asking for what are the recommended ways to parallelize your system?
which means the server chooses a group of clients and makes each of them take their local steps one by one. This takes so long and I want to do this part in parallel.
This looks like the CPU-launching overhead is the bottleneck and therefore not fully utilize the GPU capacity. You can verify if this is indeed the source of slowness using the profiler: PyTorch Profiler — PyTorch Tutorials 1.12.0+cu102 documentation
To parallelize it, you can try multi-threading or multi-processing. PyTorch does drop Python GIL when entering C++ side, so multi-threading should be able to help to some degree. This is also how
DataParallel was implemented:
If you wanna completely get rid of GIL, then you will need multi-processing, a launcher to launch 8 processes, and tools to communicate across processes.
Thank you so much @mrshenli. I implemented it via multi-threading and it works faster now. At some points, all the GPUs have non-zero utilizations which means they are running in parallel. However, during the server averaging times all of them have zero utilization which is inevitable I guess.
However, during the server averaging times all of them have zero utilization which is inevitable I guess.
Yep, looks so to me with synchronous training. To keep the GPUs busy during that, it might need some asynchrony.