If I use standard nn.DataParallel, I get the following errors:
Traceback (most recent call last):
[...]
File "/home/***/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1185, in __getattr__
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DataParallel' object has no attribute 'dim_z'
that is, errors about attributes that I have added to my network but standard nn.DataParallel isn’t “aware of”.
DataParallelPassthrough used to work just fine, but after upgrading to Python 3.10 (‘Python 3.10.2’) and PyTorch 1.11 (‘1.11.0+cu102’), I get the following error when I actually use data parallelism:
Traceback (most recent call last):
[...]
File "/home/***/lib/python3.10/site-packages/torch/cuda/nccl.py", line 51, in _check_sequence_type
if not isinstance(inputs, collections.Container) or isinstance(inputs, torch.Tensor):
AttributeError: module 'collections' has no attribute 'Container'
If I don’t call DataParallelPassthrough (i.e., if the batch size allows using a single GPU), everything is fine.
Any ideas on how I could fix this? Any insight on a better way to parallelize my (modified) network?
Based on this comment it seems to be related to Python 3.10.
Could you downgrade your Python version and check if 3.9 would work (in a new virtual environment)?
@ptrblck Thanks for your comment, I was aware of it being Python3.10-related but I thought I should ask here in case there are any insights on how to solve this, or even whether there’s a “better” way to parallelize my model.
Indeed, with python 3.9 I had no problems (not tested with python 3.9 AND PyTorch 1.11 though).
That’s strange, since the failing line of code in the current master is using:
def _check_sequence_type(inputs: Union[torch.Tensor, Sequence[torch.Tensor]]) -> None:
if not isinstance(inputs, collections.abc.Container) or isinstance(inputs, torch.Tensor):
raise TypeError("Inputs should be a collection of tensors")
now and doesn’t fit your error message anymore:
if not isinstance(inputs, collections.Container) or isinstance(inputs, torch.Tensor):
File "/home/***/lib/python3.10/site-packages/torch/cuda/nccl.py", line 51, in _check_sequence_type
if not isinstance(inputs, collections.Container) or isinstance(inputs, torch.Tensor):
AttributeError: module 'collections' has no attribute 'Container'
Apparently I have a different pytorch version before. I think you linked a forked (by carmocca) version before, but this seems to be the case in the master of the original repo as well.
Apologies if this is a stupid question, but shouldn’t I get this version (i.e., the master branch) by installing pytorch simply using an entry torch in the requirements.txt file in a venv? I think I’ve missed something trivial here…
I just checked again (after uninstalling torch and installing again using the above line) and the condition if still as isinstance(inputs, collections.Container).
root@4031504dc1c7:/workspace# python -c "import torch; print(torch.__path__)"
['/opt/conda/lib/python3.8/site-packages/torch']
root@4031504dc1c7:/workspace# sed -n 51p /opt/conda/lib/python3.8/site-packages/torch/cuda/nccl.py
if not isinstance(inputs, collections.abc.Container) or isinstance(inputs, torch.Tensor):
@nullgeppetto I think today many people having this problem because python 3.10 and torch 1.11 are lately update version!
Maybe lately version of torch/cuda with nccl doesn’t support (collections.Container) so we should use (collections.abc.Container) instead of (collections.Container)
There is no need to change the version of Pytorch!
I’m glad it worked out ^^
So, you need to change lib/python3.10/site-packages/torch/cuda/nccl.py, at line 51:
def _check_sequence_type(inputs: Union[torch.Tensor, Sequence[torch.Tensor]]) -> None:
if not isinstance(inputs, collections.abc.Container) or isinstance(inputs, torch.Tensor):
raise TypeError("Inputs should be a collection of tensors")