New to DataParallel... can I just wrap it around each layer or module?

I have modules composed of modules composed of modules. Some of these modules might not be compatible with DataParallel, I’m not sure, because they don’t return tensors (for instance, they return distributions). I would like to enable parallelization for as much of the architecture as possible… is there any downside to this? Can I just wrap the lowest level layers and modules in nn.DataParallel? As in, can I wrap nn.Sequentials and so on in DataParallel and call it a day? I’m really not sure what the best practice is with this feature.

Yes, you could wrap submodules into nn.DataParallel. Note however that nn.DataParallel is generally slower than DistributedDataParallel (using a single process per device) as it has to e.g. broadcast the modules in each forward pass.

Thanks! So, compared to neither, there wouldn’t be a downside to DataParallel, would there?

And I looked at the DistributedDataParallel docs, but it seems a little more involved than just wrapping submodules… not sure I have the time right now to go that deep.

Also… maybe I’m creating too many branches of questions, but… how does DataParallel work with modules that don’t return tensor outputs, like the one I mentioned that returns a distribution?

And once I’ve wrapped them as follows: nn.DataParallel(module), is that it? Will it parallelize when multiple GPUs available? How will it know to use GPUs and not CPUs? I on’y want it to parallelize across CUDA devices.

I don’t know and would recommend to check the behavior with a minimal code snippet.
This blog post describes the internal behavior of nn.DataParallel in more detail and you would see where the inputs are scattered/gathered etc.

Yes, by default all visible devices are used, but you can also specify them in the initialization of nn.DataParallel.

Oh, even CPUs? Hmm… it would be convenient if there was a “cuda-only” option

No, all your visible GPUs will be used.

I guess my last question is, does sub-module-wise DataParallel have a cost tradeoff compared to a single DataParallel on the very-outer module?

Yes, I would think so as the scatter/gather ops would be called on each submodule. Check the linked blog post, which describes this, and apply the same logic for each submodule to visualize how this approach would work.

So, with DistributedDataParallel, there would not be this issue? I could distribute submodule-wise, and there should not be this delay?

I don’t think so, since your workflow explicitly pushes modules to different devices and then expects to collect the output on a single device, execute some modules on a single device, and repeat this usage.
If so, then you would also need to write your DDP application in this way.