Ray.get(.) is too slow

I’m working in the federated learning setting, so I want to parallelize my code using ray, with respect to the users. So, when I use ray.get(.) , where inside get(.) I call each user, I noticed that the training is extremely slower that running the same code sequentially.
I use 3 gpus.
Any ideas??
Thanks!