Parallelization of the for-loop in federated learning

I’m curious about the implementation of parallelizing the for-loop in federated learning. The for-loop I am referring to is the one that iterates through all the selected clients for locally trained models between communication rounds. Most simulations loop through the selected clients serially, but I want this for-loop to be executed in parallel.

Specifically, given N GPUs and K clients, there are three possible relationships between N and K. Starting with the case N == K, I would like each GPU to train a model on one client’s data in parallel. To achieve this, I imagine creating N processes, with each process executing the training task. However, I have no idea where to start, so I would appreciate some guidance on this. Additionally, I am wondering if this problem could also be solved using DDP (DistributedDataParallel).

my understanding of the use case is, feeding same input <X_0, … > to N different models <M_1, …, M_n>. I’m not sure if any of the existing parallelism fits into the shoe.

For DDP, the gradient is all-reduced to locally update all replicated models.
For FSDP, the gradient is reduce-scattered to locally update the model shards.

Perhaps DDP with different initial model weights across processes is the closest one, but I’m not sure if this setting can lead to training convergence.