Issues on using nn.DataParallel with Python 3.10 and PyTorch 1.11

nullgeppetto · March 17, 2022, 4:08pm

I modify nn.DataParallel so as it handles a modified network (a GenForce GAN generator in my case), as follows:

class DataParallelPassthrough(nn.DataParallel):
    def __getattr__(self, name):
        try:
            return super(DataParallelPassthrough, self).__getattr__(name)
        except AttributeError:
            return getattr(self.module, name)

If I use standard nn.DataParallel, I get the following errors:

Traceback (most recent call last):
  [...]
  File "/home/***/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1185, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DataParallel' object has no attribute 'dim_z'

that is, errors about attributes that I have added to my network but standard nn.DataParallel isn’t “aware of”.

DataParallelPassthrough used to work just fine, but after upgrading to Python 3.10 (‘Python 3.10.2’) and PyTorch 1.11 (‘1.11.0+cu102’), I get the following error when I actually use data parallelism:

Traceback (most recent call last):
  [...]
  File "/home/***/lib/python3.10/site-packages/torch/cuda/nccl.py", line 51, in _check_sequence_type
    if not isinstance(inputs, collections.Container) or isinstance(inputs, torch.Tensor):
AttributeError: module 'collections' has no attribute 'Container'

If I don’t call DataParallelPassthrough (i.e., if the batch size allows using a single GPU), everything is fine.

Any ideas on how I could fix this? Any insight on a better way to parallelize my (modified) network?

Thank you!

ptrblck · March 17, 2022, 9:27pm

Based on this comment it seems to be related to Python 3.10.
Could you downgrade your Python version and check if 3.9 would work (in a new virtual environment)?

nullgeppetto · March 17, 2022, 9:31pm

@ptrblck Thanks for your comment, I was aware of it being Python3.10-related but I thought I should ask here in case there are any insights on how to solve this, or even whether there’s a “better” way to parallelize my model.

Indeed, with python 3.9 I had no problems (not tested with python 3.9 AND PyTorch 1.11 though).

ptrblck · March 18, 2022, 12:42am

I think the nightly binary could work with Python 3.10 as this PR seems to have fixed the issue.

nullgeppetto · March 18, 2022, 12:49am

Just to confirm, I just tried to install the nightly binary as:

pip3 install --pre torch torchvision torchaudio -f https://download.pytorch.org/whl/nightly/cu113/torch_nightly.html

and checked again, but it has not been fixed.

ptrblck · March 18, 2022, 12:55am

That’s strange, since the failing line of code in the current master is using:

def _check_sequence_type(inputs: Union[torch.Tensor, Sequence[torch.Tensor]]) -> None:
    if not isinstance(inputs, collections.abc.Container) or isinstance(inputs, torch.Tensor):
        raise TypeError("Inputs should be a collection of tensors")

now and doesn’t fit your error message anymore:

if not isinstance(inputs, collections.Container) or isinstance(inputs, torch.Tensor):

nullgeppetto · March 18, 2022, 1:05am

Indeed I get the same error

  File "/home/***/lib/python3.10/site-packages/torch/cuda/nccl.py", line 51, in _check_sequence_type
    if not isinstance(inputs, collections.Container) or isinstance(inputs, torch.Tensor):
AttributeError: module 'collections' has no attribute 'Container'

Apparently I have a different pytorch version before. I think you linked a forked (by carmocca) version before, but this seems to be the case in the master of the original repo as well.

Apologies if this is a stupid question, but shouldn’t I get this version (i.e., the master branch) by installing pytorch simply using an entry torch in the requirements.txt file in a venv? I think I’ve missed something trivial here…

ptrblck · March 18, 2022, 1:09am

Yes, you are right and I just followed the PR file change. The pytorch/master branch however also has this change as you’ve noticed.

No, I don’t think setting torch as a required package would install the nightly binary.

nullgeppetto · March 18, 2022, 1:13am

Sorry, my bad, but shouldn’t the nightly binary be installed as

pip3 install --pre torch torchvision torchaudio -f https://download.pytorch.org/whl/nightly/cu113/torch_nightly.html

I just checked again (after uninstalling torch and installing again using the above line) and the condition if still as isinstance(inputs, collections.Container).

ptrblck · March 18, 2022, 1:36am

I see the expected change in the current nightly, so I guess you are still looking into a wrong installation:

root@4031504dc1c7:/workspace# pip install --pre torch -f https://download.pytorch.org/whl/nightly/cu113/torch_nightly.html
Looking in links: https://download.pytorch.org/whl/nightly/cu113/torch_nightly.html
Collecting torch
  Downloading https://download.pytorch.org/whl/nightly/cu113/torch-1.12.0.dev20220317%2Bcu113-cp38-cp38-linux_x86_64.whl (1623.1 MB)
     |████████████████████████████████| 1623.1 MB 1.8 kB/s 
Requirement already satisfied: typing-extensions in /opt/conda/lib/python3.8/site-packages (from torch) (3.10.0.0)
Installing collected packages: torch
Successfully installed torch-1.12.0.dev20220317+cu113

root@4031504dc1c7:/workspace# python -c "import torch; print(torch.__path__)"
['/opt/conda/lib/python3.8/site-packages/torch']
root@4031504dc1c7:/workspace# sed -n 51p /opt/conda/lib/python3.8/site-packages/torch/cuda/nccl.py 
    if not isinstance(inputs, collections.abc.Container) or isinstance(inputs, torch.Tensor):

nullgeppetto · March 18, 2022, 9:27am

@ptrblck even though I was using the same nightly binary, for some strange reason it refused to update.

Eventually, I uninstalled all torch* from the venv, reinstalled and now I’m seeing the correct line.

Thanks for your help and patience!

mingun0112 · June 8, 2022, 2:10am

@nullgeppetto I think today many people having this problem because python 3.10 and torch 1.11 are lately update version!
Maybe lately version of torch/cuda with nccl doesn’t support (collections.Container) so we should use (collections.abc.Container) instead of (collections.Container)
There is no need to change the version of Pytorch!
I’m glad it worked out ^^

nullgeppetto · June 9, 2022, 8:11pm

Hi @mingun0112, indeed I observed the same thing today – changed to collections.abc.Container as a dirty but quick workaround.

jure · June 15, 2022, 2:45pm

Thanks for providing a solution But could you please also tell how and where it has to be changed?

nullgeppetto · June 17, 2022, 7:57pm

Hi @jure, apologies for the late reply.

So, you need to change lib/python3.10/site-packages/torch/cuda/nccl.py, at line 51:

def _check_sequence_type(inputs: Union[torch.Tensor, Sequence[torch.Tensor]]) -> None:
    if not isinstance(inputs, collections.abc.Container) or isinstance(inputs, torch.Tensor):
        raise TypeError("Inputs should be a collection of tensors")

jure · June 17, 2022, 10:15pm

Thank you so much