Simulating federated learning on single machine multiple gpus

Shayan_Talaei · July 31, 2022, 3:07pm

Hi,
I’m trying to implement federated learning, with a server and 100 clients, on a machine with 8 GPUs. Each client has its private model and its own private dataset. In the parallel setup, each client is taking some local steps, and after some time server asks a group of clients to send their models to the server.
However, I’m implementing a simulator and all these actions are happening consecutively, which means the server chooses a group of clients and makes each of them take their local steps one by one. This takes so long and I want to do this part in parallel. This means I want to make 8 clients take their steps simultaneously, each of which on a different GPU separately. With this, I hope to get an x8 speedup.

mrshenli · August 1, 2022, 3:50pm

Hey @Shayan_Talaei, are you asking for what are the recommended ways to parallelize your system?

which means the server chooses a group of clients and makes each of them take their local steps one by one. This takes so long and I want to do this part in parallel.

This looks like the CPU-launching overhead is the bottleneck and therefore not fully utilize the GPU capacity. You can verify if this is indeed the source of slowness using the profiler: PyTorch Profiler — PyTorch Tutorials 2.1.1+cu121 documentation

To parallelize it, you can try multi-threading or multi-processing. PyTorch does drop Python GIL when entering C++ side, so multi-threading should be able to help to some degree. This is also how DataParallel was implemented:

github.com

pytorch/pytorch/blob/d08157d5168015c3de0a6d182d08a77a38e5c207/torch/nn/parallel/parallel_apply.py#L23


      
                  for result in map(get_a_var, obj):
                      if isinstance(result, torch.Tensor):
                          return result
              if isinstance(obj, dict):
                  for result in map(get_a_var, obj.items()):
                      if isinstance(result, torch.Tensor):
                          return result
              return None
          
          
          def parallel_apply(modules, inputs, kwargs_tup=None, devices=None):
              r"""Applies each `module` in :attr:`modules` in parallel on arguments
              contained in :attr:`inputs` (positional) and :attr:`kwargs_tup` (keyword)
              on each of :attr:`devices`.
          
              Args:
                  modules (Module): modules to be parallelized
                  inputs (tensor): inputs to the modules
                  devices (list of int or torch.device): CUDA devices
          
              :attr:`modules`, :attr:`inputs`, :attr:`kwargs_tup` (if given), and

If you wanna completely get rid of GIL, then you will need multi-processing, a launcher to launch 8 processes, and tools to communicate across processes.

Shayan_Talaei · August 2, 2022, 12:36am

Thank you so much @mrshenli. I implemented it via multi-threading and it works faster now. At some points, all the GPUs have non-zero utilizations which means they are running in parallel. However, during the server averaging times all of them have zero utilization which is inevitable I guess.

mrshenli · August 5, 2022, 3:22am

However, during the server averaging times all of them have zero utilization which is inevitable I guess.

Yep, looks so to me with synchronous training. To keep the GPUs busy during that, it might need some asynchrony.

hongquan · May 17, 2023, 2:03am

Hi, I am encountering the same question now. How do you figure this out? Could you please share the code?