I got this error when using data parallel on four GPUs but the model runs fine on single gpu. Somehow the first two batches run fine on the four GPUs and after that I got this error.
File "/opt/conda/lib/python3.7/site-packages/tqdm/std.py", line 1195, in __iter__
for obj in iterable:
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 652, in __next__
data = self._next_data()
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1330, in _next_data
idx, data = self._get_data()
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1296, in _get_data
success, data = self._try_get_data()
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1134, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/opt/conda/lib/python3.7/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 303, in rebuild_storage_fd
shared_cache[fd_id(fd)] = StorageWeakRef(storage)
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 65, in __setitem__
self.free_dead_references()
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 70, in free_dead_references
if storage_ref.expired():
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 35, in expired
return torch.Storage._expired(self.cdata) # type: ignore[attr-defined]
File "/opt/conda/lib/python3.7/site-packages/torch/storage.py", line 753, in _expired
return eval(cls.__module__)._UntypedStorage._expired(*args, **kwargs)
AttributeError: module 'torch.cuda' has no attribute '_UntypedStorage'
tried multiple times and the error keeps showing. Thank you for your help.
When using data parallel to train the model on more than one GPU, this error appears.
I followed PyTorch recommendation by using distributed data parallel instead of data parallel. Please follow the following tutorials to implement distributed data parallel.
I donāt have an exact solution for this issue but I will share what I did to solve this issue.
I got this same issue āAttributeError: module ātorch.cudaā has no attribute ā_UntypedStorageāā in Colab while training Yolactedge. Previously the same notebook was working fine. I saw that the PyTorch version was 1.12 and I downgraded it to 1.8 Just to give it a try.
Used the Command
`
%cd /usr/local
!rm cuda
!ln -s cuda-10.0 cuda
!nvcc --version