Discrepancy between manual parameter registration vs using nn.ModuleList when parallelizing

In an initializer to an nn.Module, I usually get the same behavior depending on whether I append the submodules to an nn.ModuleList, or whether I append them to a regular python list while registering manually with self.add_module.

However, I notice a discrepancy when I try to run experiments on multiple GPUs. The following code results in RuntimeError: Expected all tensors to be on the same device.

import torch
import torch.nn as nn

class MyNetwork(nn.Module):
    def __init__(self):
        self.layers = []

        for i in range(3):
            layer = nn.Linear(10, 10, bias=False)
            self.add_module(f'layer_{i}', layer)

    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        return x

model = MyNetwork()
model = model.to(torch.device('cuda'))

p_model = nn.DataParallel(model, device_ids=['cuda:0', 'cuda:1'])
p_model(torch.zeros((5, 10)))

When I change self.layers = [] to self.layers = nn.ModuleList() and delete self.add_module(f'layer_{i}', layer), the code works fine.

Why does DataParallel recognize the submodules differently in these two scenarios?

I don’t know what DataParallel is using internally, but note that it’s deprecated and replaced by DistributedDataParallel for performance reasons and an even memory usage.

I understand DistributedDataParallel is better than DataParallel, but I’m asking more about the implicit contract - automatic parameter registration (i.e. from self.__setattr__) should always be the same as manual parameter registration (i.e. from self.add_module).

That is, how is it possible for any wrapper to know the difference between manual parameter registration and automatic parameter registration? I feel like some implicit contract is broken to observe them behaving differently.

The lesson I just learned as a user is that automatic registration is not the same as manual registration, and I have evidence to always prefer the former. Since I have an instance of the latter failing, I’ll probably never use manual registration again, and I am confused why it is not deprecated. Therefore, to state my question differently: “Am I drawing the wrong lesson here?”

I just found this comment which argues this is “expected” behavior, but a contract breach should not be expected behavior. If I understand this comment correctly, this seems to be happening because DataParallel uses the module’s attributes from its __dict__ instead of referring to the module’s registered parameters directly. Perhaps I should I file this as a Github issue.

The issue I’m seeing is that DataParallel is deprecated. So even if you claim a contract is broken, I doubt DataParallel will receive any fixes.

Is it correct to say that DataParallel is deprecated? I interpreted the docs to just be giving a recommendation to use DistributedDataParallel. Also, that recommendation has existed in the docs for at least 3 years so I would think it’d be removed by now if it were deprecated.

I know why this recommendation exists. DataParallel uses multithreading which means it gets a bit stymied from Python’s Global Interpreter Lock whereas DistributedDataParallel uses multi-process to get around this. But I never notice much of a performance hit.

Are there any docs somewhere that I am missing that say it is on its way out?

Additionally to the GIL issues, your workload would also suffer from an imbalanced GPU memory usage (the default device would have a higher usage) and the additional communication overhead as the model is copied to all devices in each forward pass which is not the case for DDP.

Yes, based on e.g. this comment:

No new features will be added to DataParallel, it’s in maintenance mode #65936. Wrappers make launching distributed jobs really easy, give them a try.

1 Like

Thanks! Appreciate the additional detail about the communication overhead.

On the broader point about the contract, should automatic parameter registration always behave the same as manual parameter registration? (i.e. anytime they do not is because of a bug)

Yes, my expectation would be that both APIs should yield the same results and if a supported module depends on one particular way of registering a parameter I would see it as a bug.

1 Like