Parallelizing different models with shared replay buffer

I have 5 different models. I want them to train asychronously with a shared replay buffer (the replay buffer only needs to be synced every 1000 iterations as new samples are added).

How do I do this?

While I understand Pytorch launches cuda operations asynchronously, I think whenever I call .backwards() it is a synchronization point. Therefore, the naive solution of having an array of models = […] and looping with a for loop the inner procedure of: zero_grad, backward, step will not actually parallelize the training.