File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 348559) is killed by signal: Aborted

I am having a problem with torch DataLoader. Could you please help me with the fix?

(zephyr) mona@ard-gpu-01:~/zephyr/python/zephyr$ python test.py     --model_name pn2     --dataset_root ./data/ycb/matches_data_test/     --dataset_name ycbv     --dataset HSVD_diff_uv_norm     --no_valid_proj --no_valid_depth     --loss_cutoff log     --exp_name final     --resume_path ./ckpts/final_ycbv.ckpt
exp_name: pn2_HSVD_diff_uv_norm_final
args.icp = True
Initializing ycbv dataset from ./data/ycb/matches_data_test/
Using BOP dataset format. Total dataset: 4123
Using PointNet Dataset
Initializating test dataset ['u', 'v', 'H_diff', 'S_diff', 'V_diff', 'D_diff', 'norm_cos']
dim_agg: 0 dim_point: 7
############ BOP test set: 1 ##############
No loss on the best hypotheses
PointNet2: extra_bottleneck_dim = 0
mask: [] xyz: [0, 1] points: [2, 3, 4, 5, 6]
GPU available: True, used: True
No environment variable for node rank defined. Set as 0.
CUDA_VISIBLE_DEVICES: [0]
Testing: 0it [00:00, ?it/s]python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
Traceback (most recent call last):
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1120, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/multiprocessing/queues.py", line 113, in get
    if not self._poll(timeout):
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/multiprocessing/connection.py", line 424, in _poll
    r = wait([self], timeout)
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/selectors.py", line 416, in select
    fd_event_list = self._selector.poll(timeout)
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 346793) is killed by signal: Aborted. 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/mona/zephyr/python/zephyr/test.py", line 59, in <module>
    main(args)
  File "/home/mona/zephyr/python/zephyr/test.py", line 53, in main
    trainer.test(model, boptest_loader)
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1064, in test
    self.fit(model)
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 853, in fit
    self.dp_train(model)
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 578, in dp_train
    self.run_pretrain_routine(model)
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 982, in run_pretrain_routine
    self.run_evaluation(test_mode=True)
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 377, in run_evaluation
    eval_results = self._evaluate(self.model, dataloaders, max_batches, test_mode)
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 256, in _evaluate
    for batch_idx, batch in enumerate(dataloader):
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1316, in _next_data
    idx, data = self._get_data()
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1282, in _get_data
    success, data = self._try_get_data()
  File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 346793) exited unexpectedly
Testing:   0%|          | 0/4123 [00:01<?, ?it/s]
(zephyr) mona@ard-gpu-01:~/zephyr/python/zephyr$ python collect_env_details.py 
<details>
  <summary>Current environment</summary>

* CUDA:
	- GPU:
		- NVIDIA GeForce RTX 3080 Laptop GPU
	- available:         True
	- version:           11.7
* Lightning:
	- lightning-cloud:   0.5.36
	- lightning-utilities: 0.8.0
	- pytorch-lightning: 0.7.6
	- torch:             1.13.0+cu117
	- torchmetrics:      0.11.4
	- torchtext:         0.14.0
	- torchvision:       0.14.0+cu117
* Packages:
	- absl-py:           1.4.0
	- addict:            2.4.0
	- aiohttp:           3.8.4
	- aiosignal:         1.3.1
	- anyio:             3.7.0
	- appdirs:           1.4.4
	- arrow:             1.2.3
	- astropy:           5.3
	- asttokens:         2.2.1
	- async-timeout:     4.0.2
	- attrs:             23.1.0
	- backcall:          0.2.0
	- beautifulsoup4:    4.12.2
	- blessed:           1.20.0
	- cachetools:        5.3.1
	- certifi:           2023.5.7
	- charset-normalizer: 3.1.0
	- click:             8.1.3
	- cloudpickle:       2.2.1
	- comm:              0.1.3
	- configargparse:    1.5.3
	- contourpy:         1.0.7
	- croniter:          1.3.15
	- cycler:            0.11.0
	- dash:              2.10.2
	- dash-core-components: 2.0.0
	- dash-html-components: 2.0.0
	- dash-table:        5.0.0
	- dask:              2023.5.1
	- dateutils:         0.6.12
	- debugpy:           1.6.7
	- decorator:         5.1.1
	- deepdiff:          6.3.0
	- exceptiongroup:    1.1.1
	- executing:         1.2.0
	- fastapi:           0.88.0
	- fastjsonschema:    2.17.1
	- flask:             2.2.5
	- fonttools:         4.39.4
	- freetype-py:       2.4.0
	- frozenlist:        1.3.3
	- fsspec:            2023.5.0
	- future:            0.18.3
	- google-auth:       2.19.0
	- google-auth-oauthlib: 1.0.0
	- grpcio:            1.54.2
	- h11:               0.14.0
	- idna:              3.4
	- imageio:           2.30.0
	- importlib-metadata: 6.6.0
	- importlib-resources: 5.12.0
	- inquirer:          3.1.3
	- ipykernel:         6.23.1
	- ipython:           8.13.2
	- ipywidgets:        8.0.6
	- itsdangerous:      2.1.2
	- jedi:              0.18.2
	- jinja2:            3.1.2
	- joblib:            1.2.0
	- jsonschema:        4.17.3
	- jupyter-client:    8.2.0
	- jupyter-core:      5.3.0
	- jupyterlab-widgets: 3.0.7
	- kiwisolver:        1.4.4
	- lazy-loader:       0.2
	- lightning-cloud:   0.5.36
	- lightning-utilities: 0.8.0
	- locket:            1.0.0
	- mako:              1.2.4
	- markdown:          3.4.3
	- markdown-it-py:    2.2.0
	- markupsafe:        2.1.2
	- matplotlib:        3.7.1
	- matplotlib-inline: 0.1.6
	- mdurl:             0.1.2
	- multidict:         6.0.4
	- mvtec-halcon:      23050.0.0
	- nbformat:          5.7.0
	- nest-asyncio:      1.5.6
	- networkx:          3.1
	- numpy:             1.22.3
	- oauthlib:          3.2.2
	- open3d:            0.17.0
	- opencv-python:     4.7.0.72
	- ordered-set:       4.1.0
	- packaging:         23.1
	- pandas:            2.0.2
	- parso:             0.8.3
	- partd:             1.4.0
	- pexpect:           4.8.0
	- pickleshare:       0.7.5
	- pillow:            9.5.0
	- pip:               23.0.1
	- platformdirs:      3.5.1
	- plotly:            5.14.1
	- plyfile:           0.9
	- pooch:             1.7.0
	- prompt-toolkit:    3.0.38
	- protobuf:          4.23.2
	- psutil:            5.9.5
	- ptyprocess:        0.7.0
	- pure-eval:         0.2.2
	- pyamg:             5.0.0
	- pyasn1:            0.5.0
	- pyasn1-modules:    0.3.0
	- pybind11:          2.9.2
	- pybind11-global:   2.10.4
	- pycuda:            2022.2.2
	- pydantic:          1.10.8
	- pyerfa:            2.0.0.3
	- pyglet:            2.0.7
	- pygments:          2.15.1
	- pyjwt:             2.7.0
	- pyopengl:          3.1.0
	- pyparsing:         3.0.9
	- pyquaternion:      0.9.9
	- pyrender:          0.1.45
	- pyrsistent:        0.19.3
	- python-dateutil:   2.8.2
	- python-editor:     1.0.4
	- python-multipart:  0.0.6
	- pytools:           2022.1.14
	- pytorch-lightning: 0.7.6
	- pytz:              2023.3
	- pywavelets:        1.4.1
	- pyyaml:            6.0
	- pyzmq:             25.1.0
	- readchar:          4.0.5
	- requests:          2.31.0
	- requests-oauthlib: 1.3.1
	- rich:              13.4.1
	- rsa:               4.9
	- scikit-image:      0.20.0
	- scikit-learn:      1.2.2
	- scipy:             1.9.1
	- setuptools:        67.8.0
	- simpleitk:         2.2.1
	- six:               1.16.0
	- sniffio:           1.3.0
	- soupsieve:         2.4.1
	- stack-data:        0.6.2
	- starlette:         0.22.0
	- starsessions:      1.3.0
	- tenacity:          8.2.2
	- tensorboard:       2.13.0
	- tensorboard-data-server: 0.7.0
	- threadpoolctl:     3.1.0
	- tifffile:          2023.4.12
	- toolz:             0.12.0
	- torch:             1.13.0+cu117
	- torchmetrics:      0.11.4
	- torchtext:         0.14.0
	- torchvision:       0.14.0+cu117
	- tornado:           6.3.2
	- tqdm:              4.65.0
	- traitlets:         5.9.0
	- trimesh:           3.21.7
	- typing-extensions: 4.6.2
	- tzdata:            2023.3
	- urllib3:           1.26.16
	- uvicorn:           0.22.0
	- wcwidth:           0.2.6
	- websocket-client:  1.5.2
	- websockets:        11.0.3
	- werkzeug:          2.2.3
	- wheel:             0.38.4
	- widgetsnbextension: 4.0.7
	- yarl:              1.9.2
	- zephyr:            0.1.dev0
	- zipp:              3.15.0
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		- ELF
	- processor:         x86_64
	- python:            3.9.16
	- release:           5.19.0-42-generic
	- version:           #43~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Apr 21 16:51:08 UTC 2

</details>

and

(zephyr) mona@ard-gpu-01:~/zephyr/python/zephyr$ python -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 1.13.0+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.2 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0
Clang version: 14.0.0-1ubuntu1
CMake version: version 3.22.1
Libc version: glibc-2.35

Python version: 3.9.16 (main, Mar  8 2023, 14:00:05)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.19.0-42-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 11.7.99
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3080 Laptop GPU
Nvidia driver version: 530.30.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.24.3
[pip3] pytorch-lightning==0.7.6
[pip3] torch==1.13.0+cu117
[pip3] torchmetrics==0.11.4
[pip3] torchtext==0.14.0
[pip3] torchvision==0.14.0+cu117
[conda] numpy                     1.24.3                   pypi_0    pypi
[conda] pytorch-lightning         0.7.6                    pypi_0    pypi
[conda] torch                     1.13.0+cu117             pypi_0    pypi
[conda] torchmetrics              0.11.4                   pypi_0    pypi
[conda] torchtext                 0.14.0                   pypi_0    pypi
[conda] torchvision               0.14.0+cu117             pypi_0    pypi

Please let me know if there is need for more information?

repo: GitHub - r-pad/zephyr: Source code for ZePHyR: Zero-shot Pose Hypothesis Rating @ ICRA 2021


If I set the num_workers to 0, this is what happens:

(zephyr) mona@ard-gpu-01:~/zephyr/python/zephyr$ python test.py     --model_name pn2     --dataset_root ./data/ycb/matches_data_test/     --dataset_name ycbv     --dataset HSVD_diff_uv_norm     --no_valid_proj --no_valid_depth     --loss_cutoff log     --exp_name final     --resume_path ./ckpts/final_ycbv.ckpt --num_workers=0
exp_name: pn2_HSVD_diff_uv_norm_final
args.icp = True
Initializing ycbv dataset from ./data/ycb/matches_data_test/
Using BOP dataset format. Total dataset: 4123
Using PointNet Dataset
Initializating test dataset ['u', 'v', 'H_diff', 'S_diff', 'V_diff', 'D_diff', 'norm_cos']
dim_agg: 0 dim_point: 7
############ BOP test set: 1 ##############
No loss on the best hypotheses
PointNet2: extra_bottleneck_dim = 0
mask: [] xyz: [0, 1] points: [2, 3, 4, 5, 6]
GPU available: True, used: True
No environment variable for node rank defined. Set as 0.
CUDA_VISIBLE_DEVICES: [0]
/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/pytorch_lightning/utilities/distributed.py:23: UserWarning: The dataloader, test dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
Testing: 0it [00:00, ?it/s]python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
Aborted (core dumped)

it seems the problem is related to eigen not torch or torch_lightning
Testing: 0it [00:00, ?it/s]python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)’ failed.

I’ll update later when I figure it.