Combining model parallelism with streams for embarassingly parallelizable model

I’m currently trying to implement the model in this paper https://arxiv.org/abs/1607.06854 which has 350 units, which during one forward step can be evaluated completely independently of one another. I’d like to distribute the calculation of each unit across multiple GPUs (I have two) and, assuming this would be better performance-wise, across multiple streams.

I found this which gives me a good idea of how to split the model across GPUs, but I can’t find any good examples how to use streams in PyTorch, and I don’t understand the documentation. Is using the class torch.cuda.Stream to get additional concurrency feasible, or am I completely wrong about this? I have already implemented the unit as an nn.Module, let’s call it PVM_unit, I can add the multiple instances of it to my model with setattr, but how can I run them as fast as possible independently?

Thanks, ahead of time.