Nonfatal AttributeErrror preceding fatal 'RuntimeError: Pin memory threat exited unexpectedly'

I’m running a repo that requires PyTorch3D, which in turn requires PyTorch. Other answers I’ve seen regarding the pin memory RuntimeError suggest upgrading PyTorch >= 1.7, which I have done. Before delving into this, I was able to get a successful run for the script in question. However, even on that successful run, the AttributeError was still thrown. Here is my setup:

NVidia Driver: 516.94
cudatoolkit: 11.6.0
PyTorch: 1.12.1
PyTorch3D: 0.3.0

I tried updating my libraries and running the same script - now it throws the fatal Runtime Error.

0%|                                                | 0/150000 [00:00<?, ?it/s]Exception in thread Thread-2:
Traceback (most recent call last):
  File "/home/domattioli/miniconda3/envs/anerf/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/home/domattioli/miniconda3/envs/anerf/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/domattioli/miniconda3/envs/anerf/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 28, in _pin_memory_loop
    r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
  File "/home/domattioli/miniconda3/envs/anerf/lib/python3.8/multiprocessing/queues.py", line 116, in get
    return _ForkingPickler.loads(res)
  File "/home/domattioli/miniconda3/envs/anerf/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 303, in rebuild_storage_fd
    shared_cache[fd_id(fd)] = StorageWeakRef(storage)
  File "/home/domattioli/miniconda3/envs/anerf/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 65, in __setitem__
    self.free_dead_references()
  File "/home/domattioli/miniconda3/envs/anerf/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 70, in free_dead_references
    if storage_ref.expired():
  File "/home/domattioli/miniconda3/envs/anerf/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 35, in expired
    return torch.Storage._expired(self.cdata)  # type: ignore[attr-defined]
  File "/home/domattioli/miniconda3/envs/anerf/lib/python3.8/site-packages/torch/storage.py", line 757, in _expired
    return eval(cls.__module__)._UntypedStorage._expired(*args, **kwargs)
AttributeError: module 'torch.cuda' has no attribute '_UntypedStorage'
/home/domattioli/Projects/HPSE/src/A-NeRF/core/trainer.py:178: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  optim_step = optimizer.state[optimizer.param_groups[0]['params'][0]]['step'] // decay_unit
  0%|                                     | 1/150000 [00:01<57:17:05,  1.37s/it]
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File ~/Projects/HPSE/src/A-NeRF/run_nerf.py:630, in <module>
    627 torch.set_default_tensor_type('torch.cuda.FloatTensor')
    628 torch.multiprocessing.set_start_method('spawn')
--> 630 train()

File ~/Projects/HPSE/src/A-NeRF/run_nerf.py:545, in train()
    543 for i in trange(start, N_iters):
    544     time0 = time.time()
--> 545     batch = next(train_iter)
    546     loss_dict, stats = trainer.train_batch(batch, i, global_step)
    548     # Rest is logging

File ~/miniconda3/envs/anerf/lib/python3.8/site-packages/torch/utils/data/dataloader.py:681, in _BaseDataLoaderIter.__next__(self)
    678 if self._sampler_iter is None:
    679     # TODO(https://github.com/pytorch/pytorch/issues/76750)
    680     self._reset()  # type: ignore[call-arg]
--> 681 data = self._next_data()
    682 self._num_yielded += 1
    683 if self._dataset_kind == _DatasetKind.Iterable and \
    684         self._IterableDataset_len_called is not None and \
    685         self._num_yielded > self._IterableDataset_len_called:

File ~/miniconda3/envs/anerf/lib/python3.8/site-packages/torch/utils/data/dataloader.py:1359, in _MultiProcessingDataLoaderIter._next_data(self)
   1356     return self._process_data(data)
   1358 assert not self._shutdown and self._tasks_outstanding > 0
-> 1359 idx, data = self._get_data()
   1360 self._tasks_outstanding -= 1
   1361 if self._dataset_kind == _DatasetKind.Iterable:
   1362     # Check for _IterableDatasetStopIteration

File ~/miniconda3/envs/anerf/lib/python3.8/site-packages/torch/utils/data/dataloader.py:1320, in _MultiProcessingDataLoaderIter._get_data(self)
   1317             return data
   1318     else:
   1319         # while condition is false, i.e., pin_memory_thread died.
-> 1320         raise RuntimeError('Pin memory thread exited unexpectedly')
   1321     # In this case, `self._data_queue` is a `queue.Queue`,. But we don't
   1322     # need to call `.task_done()` because we don't use `.join()`.
   1323 else:
   1324     while True:

RuntimeError: Pin memory thread exited unexpectedly

Are you seeing the same issue when pinned memory is not used or if you set the num_workers to 0?
Also, is this specific to a DDP use case or are you seeing the issue on a single GPU also?

These are good questions and, admittedly, beyond my novice understanding of GPU utilization. As I understand your questions, are you suggesting that the code isn’t properly setup to use the GPU, i.e. num_workers > 0?

I did downgrade PyTorch to 1.11 and neither error no longer occurs. I am curious if anyone else has encountered it, though.

Hi @ptrblck , I am seeing this error with pytorch 1.11, when I set pinned_memory=True, num_workers=0, and kill and start the job again. The error only occurs when all these 3 conditions are met.

I would recommend updating to the latest stable or nightly release to see if this was a known issue which might already be fixed.

I’m using the docker pt 1.11 image ( Image Layer Details - pytorch/pytorch:1.11.0-cuda11.3-cudnn8-devel | Docker Hub). This would have the stable build, right?

Also, a corretion to the above post: I am seeing this error with pytorch 1.11, when I set pinned_memory=True, num_workers>0, and kill and start the job again. The error only occurs when all these 3 conditions are met.

No, torch==1.11.0 is not the latest stable release, so install 2.0.0 or a nightly.