Running multiple modules in a ModuleList on different GPUs in parallel?

Is there a way to run the modules in a ModuleList in parallel on multiple GPUs? Their inputs are shaped differently and they don’t depend on one another.

Yes, if you are pushing different layers and their inputs to different GPUs, their execution will be asynchronous.

Oh, but they’re in a for loop. Python will automatically proceed to the next iteration of the for loop?

Yes, CUDA kernels are launched by the CPU and are then executed asynchronously, which means the CPU can run ahead and execute other work, such as launching another CUDA kernel on another device.

Okay, thanks. Hmm, would it work with DataParallel? Or do I manually have to map each iteration of the for loop to a specific GPU? All the tutorials I can find use DataParallel.

nn.DataParallel works similar, but has communications added to its approach. This blog post explains its mechanism.