Parallel Training of Multiple Models

I am trying to train N independant models using M GPUs in parallel on one machine. What I currently want to achieve is training the N models, M at a time in parallel for given number of epochs, store the intermediate return output of each model until all are done, process the stored outputs and repeat for a number of rounds.

Each client has a device property with a GPU id, and model parameters are assigned to the device before training. The device_dict dictionary has one key for each gpu containing a list of client ids assigned to the device. Here is what I have implemented so far (untested) and am unsure if that is the best way of doing this.

def train_mp(self, num_rounds, train_epochs):

    # Initialize logit queue for server update after each round
    logit_queue = Queue()

    for _ in num_rounds:
        self.round += 1

        diffusion_seed = self.server.generate_seed()
        server_logit = self.server.get_logit()
        processes = [] 

        # Start processes for each client on each device
        for i in range(math.ceil(self.num_clients / self.num_devices)):
            for device, client_ids in self.device_dict.items():
                if i < len(client_ids):
                    process = mp.Process(target=self.client_update, args=(self.clients[client_ids[i]], server_logit, diffusion_seed, logit_queue))
                    process.start()
                    processes.append(process)

        # Wait for all processes to finish
        for process in processes:
            process.join()

        # Update server model with client logit queue
        self.server.knowledge_distillation(logit_queue)

I currently do not have access to a multi-GPU machine to test anything so am unsure what the best way of doing this would be. Any help would be appreciated.