ThreadPoolExecutor GPU utilization is 100%

Hello,

I want to train several clients (models) in parallel. Here is my sequential training of the clients:

`for idx in idxs_users:
clients[idx].set_state_dict(copy.deepcopy(w_glob))
loss = clients[idx].train(is_print=False)`

Now I have changed my code to run the for-loop in parallel using concurrent.futures.ThreadPoolExecutor. Here is my code:


 jobs, results = [], []
        #max_len = sum(len(user) for user in idxs_users) - 4
        torch.set_num_threads(1)
        torch.set_num_interop_threads(1)
        max_len = 10
        streams = [torch.cuda.Stream() for _ in range(max_len)]

        #torch.cuda.synchronize()
        # Execute tasks with limited concurrency
        with concurrent.futures.ThreadPoolExecutor(max_workers=min(max_len, os.cpu_count() - 1)) as executor:
            for i, (cn, idx) in enumerate([(cn, idx) for cn in range(len(idxs_users)) for idx in idxs_users[cn]]):
                stream = streams[i % max_len]  # Select stream
                #torch.cuda.synchronize()
                with torch.cuda.stream(stream):
                    jobs.append(executor.submit(train_client, clients[cn][idx], copy.deepcopy(w_glob[cn])))

        concurrent.futures.wait(jobs)
        for stream in streams:
            with torch.cuda.stream(stream):
                torch.cuda.synchronize()

        loss_locals = [job.result() for job in jobs]

However, the problem is that my code stuck here for hours and hours. My GPU utilization is 100% and cpu is also very high. My guess is that all task want to access GPU at the same time and this causes my code getting stuck. Could you please help? If there is any suggesting how to solve this issue or how to parallelize my client trainings and avoid for-loop?
Thank you.