How does work with multiple CPU threads?

If I have N CPU threads that are sampling from a simulator (such as in deep reinforcement learning),
I frequently will need to move the model to a GPU for backprop.

If multiple threads send the model to the same GPU asychronously, is there a deadlock problem? How do we set this up efficiently? E.g., perhaps it is optimal to send X models to the GPU at once, but no more, or else the whole system will slow down.