@nathan you may need to invoke two threads each for managing one GPU, because although the CUDA calls are async by default, you might introduce some python-level sync point in your module.
Could you post up a sample code of your submodule?
Hey nathan, I was thinking of doing the same, I have 2 models I wanted to optimize in parallel using 2 GPUs however I see you said it takes longer, was this a bug or is it really the case? If so I will not bother with it, unless it was easy to implement and try?