I am having a problem with torch DataLoader. Could you please help me with the fix?
(zephyr) mona@ard-gpu-01:~/zephyr/python/zephyr$ python test.py --model_name pn2 --dataset_root ./data/ycb/matches_data_test/ --dataset_name ycbv --dataset HSVD_diff_uv_norm --no_valid_proj --no_valid_depth --loss_cutoff log --exp_name final --resume_path ./ckpts/final_ycbv.ckpt
exp_name: pn2_HSVD_diff_uv_norm_final
args.icp = True
Initializing ycbv dataset from ./data/ycb/matches_data_test/
Using BOP dataset format. Total dataset: 4123
Using PointNet Dataset
Initializating test dataset ['u', 'v', 'H_diff', 'S_diff', 'V_diff', 'D_diff', 'norm_cos']
dim_agg: 0 dim_point: 7
############ BOP test set: 1 ##############
No loss on the best hypotheses
PointNet2: extra_bottleneck_dim = 0
mask: [] xyz: [0, 1] points: [2, 3, 4, 5, 6]
GPU available: True, used: True
No environment variable for node rank defined. Set as 0.
CUDA_VISIBLE_DEVICES: [0]
Testing: 0it [00:00, ?it/s]python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
Traceback (most recent call last):
File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1120, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/multiprocessing/queues.py", line 113, in get
if not self._poll(timeout):
File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/multiprocessing/connection.py", line 424, in _poll
r = wait([self], timeout)
File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/multiprocessing/connection.py", line 931, in wait
ready = selector.select(timeout)
File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/selectors.py", line 416, in select
fd_event_list = self._selector.poll(timeout)
File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 346793) is killed by signal: Aborted.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/mona/zephyr/python/zephyr/test.py", line 59, in <module>
main(args)
File "/home/mona/zephyr/python/zephyr/test.py", line 53, in main
trainer.test(model, boptest_loader)
File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1064, in test
self.fit(model)
File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 853, in fit
self.dp_train(model)
File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 578, in dp_train
self.run_pretrain_routine(model)
File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 982, in run_pretrain_routine
self.run_evaluation(test_mode=True)
File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 377, in run_evaluation
eval_results = self._evaluate(self.model, dataloaders, max_batches, test_mode)
File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 256, in _evaluate
for batch_idx, batch in enumerate(dataloader):
File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
data = self._next_data()
File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1316, in _next_data
idx, data = self._get_data()
File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1282, in _get_data
success, data = self._try_get_data()
File "/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 346793) exited unexpectedly
Testing: 0%| | 0/4123 [00:01<?, ?it/s]
(zephyr) mona@ard-gpu-01:~/zephyr/python/zephyr$ python collect_env_details.py
<details>
<summary>Current environment</summary>
* CUDA:
- GPU:
- NVIDIA GeForce RTX 3080 Laptop GPU
- available: True
- version: 11.7
* Lightning:
- lightning-cloud: 0.5.36
- lightning-utilities: 0.8.0
- pytorch-lightning: 0.7.6
- torch: 1.13.0+cu117
- torchmetrics: 0.11.4
- torchtext: 0.14.0
- torchvision: 0.14.0+cu117
* Packages:
- absl-py: 1.4.0
- addict: 2.4.0
- aiohttp: 3.8.4
- aiosignal: 1.3.1
- anyio: 3.7.0
- appdirs: 1.4.4
- arrow: 1.2.3
- astropy: 5.3
- asttokens: 2.2.1
- async-timeout: 4.0.2
- attrs: 23.1.0
- backcall: 0.2.0
- beautifulsoup4: 4.12.2
- blessed: 1.20.0
- cachetools: 5.3.1
- certifi: 2023.5.7
- charset-normalizer: 3.1.0
- click: 8.1.3
- cloudpickle: 2.2.1
- comm: 0.1.3
- configargparse: 1.5.3
- contourpy: 1.0.7
- croniter: 1.3.15
- cycler: 0.11.0
- dash: 2.10.2
- dash-core-components: 2.0.0
- dash-html-components: 2.0.0
- dash-table: 5.0.0
- dask: 2023.5.1
- dateutils: 0.6.12
- debugpy: 1.6.7
- decorator: 5.1.1
- deepdiff: 6.3.0
- exceptiongroup: 1.1.1
- executing: 1.2.0
- fastapi: 0.88.0
- fastjsonschema: 2.17.1
- flask: 2.2.5
- fonttools: 4.39.4
- freetype-py: 2.4.0
- frozenlist: 1.3.3
- fsspec: 2023.5.0
- future: 0.18.3
- google-auth: 2.19.0
- google-auth-oauthlib: 1.0.0
- grpcio: 1.54.2
- h11: 0.14.0
- idna: 3.4
- imageio: 2.30.0
- importlib-metadata: 6.6.0
- importlib-resources: 5.12.0
- inquirer: 3.1.3
- ipykernel: 6.23.1
- ipython: 8.13.2
- ipywidgets: 8.0.6
- itsdangerous: 2.1.2
- jedi: 0.18.2
- jinja2: 3.1.2
- joblib: 1.2.0
- jsonschema: 4.17.3
- jupyter-client: 8.2.0
- jupyter-core: 5.3.0
- jupyterlab-widgets: 3.0.7
- kiwisolver: 1.4.4
- lazy-loader: 0.2
- lightning-cloud: 0.5.36
- lightning-utilities: 0.8.0
- locket: 1.0.0
- mako: 1.2.4
- markdown: 3.4.3
- markdown-it-py: 2.2.0
- markupsafe: 2.1.2
- matplotlib: 3.7.1
- matplotlib-inline: 0.1.6
- mdurl: 0.1.2
- multidict: 6.0.4
- mvtec-halcon: 23050.0.0
- nbformat: 5.7.0
- nest-asyncio: 1.5.6
- networkx: 3.1
- numpy: 1.22.3
- oauthlib: 3.2.2
- open3d: 0.17.0
- opencv-python: 4.7.0.72
- ordered-set: 4.1.0
- packaging: 23.1
- pandas: 2.0.2
- parso: 0.8.3
- partd: 1.4.0
- pexpect: 4.8.0
- pickleshare: 0.7.5
- pillow: 9.5.0
- pip: 23.0.1
- platformdirs: 3.5.1
- plotly: 5.14.1
- plyfile: 0.9
- pooch: 1.7.0
- prompt-toolkit: 3.0.38
- protobuf: 4.23.2
- psutil: 5.9.5
- ptyprocess: 0.7.0
- pure-eval: 0.2.2
- pyamg: 5.0.0
- pyasn1: 0.5.0
- pyasn1-modules: 0.3.0
- pybind11: 2.9.2
- pybind11-global: 2.10.4
- pycuda: 2022.2.2
- pydantic: 1.10.8
- pyerfa: 2.0.0.3
- pyglet: 2.0.7
- pygments: 2.15.1
- pyjwt: 2.7.0
- pyopengl: 3.1.0
- pyparsing: 3.0.9
- pyquaternion: 0.9.9
- pyrender: 0.1.45
- pyrsistent: 0.19.3
- python-dateutil: 2.8.2
- python-editor: 1.0.4
- python-multipart: 0.0.6
- pytools: 2022.1.14
- pytorch-lightning: 0.7.6
- pytz: 2023.3
- pywavelets: 1.4.1
- pyyaml: 6.0
- pyzmq: 25.1.0
- readchar: 4.0.5
- requests: 2.31.0
- requests-oauthlib: 1.3.1
- rich: 13.4.1
- rsa: 4.9
- scikit-image: 0.20.0
- scikit-learn: 1.2.2
- scipy: 1.9.1
- setuptools: 67.8.0
- simpleitk: 2.2.1
- six: 1.16.0
- sniffio: 1.3.0
- soupsieve: 2.4.1
- stack-data: 0.6.2
- starlette: 0.22.0
- starsessions: 1.3.0
- tenacity: 8.2.2
- tensorboard: 2.13.0
- tensorboard-data-server: 0.7.0
- threadpoolctl: 3.1.0
- tifffile: 2023.4.12
- toolz: 0.12.0
- torch: 1.13.0+cu117
- torchmetrics: 0.11.4
- torchtext: 0.14.0
- torchvision: 0.14.0+cu117
- tornado: 6.3.2
- tqdm: 4.65.0
- traitlets: 5.9.0
- trimesh: 3.21.7
- typing-extensions: 4.6.2
- tzdata: 2023.3
- urllib3: 1.26.16
- uvicorn: 0.22.0
- wcwidth: 0.2.6
- websocket-client: 1.5.2
- websockets: 11.0.3
- werkzeug: 2.2.3
- wheel: 0.38.4
- widgetsnbextension: 4.0.7
- yarl: 1.9.2
- zephyr: 0.1.dev0
- zipp: 3.15.0
* System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.9.16
- release: 5.19.0-42-generic
- version: #43~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Apr 21 16:51:08 UTC 2
</details>
and
(zephyr) mona@ard-gpu-01:~/zephyr/python/zephyr$ python -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 1.13.0+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.2 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0
Clang version: 14.0.0-1ubuntu1
CMake version: version 3.22.1
Libc version: glibc-2.35
Python version: 3.9.16 (main, Mar 8 2023, 14:00:05) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.19.0-42-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 11.7.99
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3080 Laptop GPU
Nvidia driver version: 530.30.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.24.3
[pip3] pytorch-lightning==0.7.6
[pip3] torch==1.13.0+cu117
[pip3] torchmetrics==0.11.4
[pip3] torchtext==0.14.0
[pip3] torchvision==0.14.0+cu117
[conda] numpy 1.24.3 pypi_0 pypi
[conda] pytorch-lightning 0.7.6 pypi_0 pypi
[conda] torch 1.13.0+cu117 pypi_0 pypi
[conda] torchmetrics 0.11.4 pypi_0 pypi
[conda] torchtext 0.14.0 pypi_0 pypi
[conda] torchvision 0.14.0+cu117 pypi_0 pypi
Please let me know if there is need for more information?
repo: GitHub - r-pad/zephyr: Source code for ZePHyR: Zero-shot Pose Hypothesis Rating @ ICRA 2021
If I set the num_workers
to 0, this is what happens:
(zephyr) mona@ard-gpu-01:~/zephyr/python/zephyr$ python test.py --model_name pn2 --dataset_root ./data/ycb/matches_data_test/ --dataset_name ycbv --dataset HSVD_diff_uv_norm --no_valid_proj --no_valid_depth --loss_cutoff log --exp_name final --resume_path ./ckpts/final_ycbv.ckpt --num_workers=0
exp_name: pn2_HSVD_diff_uv_norm_final
args.icp = True
Initializing ycbv dataset from ./data/ycb/matches_data_test/
Using BOP dataset format. Total dataset: 4123
Using PointNet Dataset
Initializating test dataset ['u', 'v', 'H_diff', 'S_diff', 'V_diff', 'D_diff', 'norm_cos']
dim_agg: 0 dim_point: 7
############ BOP test set: 1 ##############
No loss on the best hypotheses
PointNet2: extra_bottleneck_dim = 0
mask: [] xyz: [0, 1] points: [2, 3, 4, 5, 6]
GPU available: True, used: True
No environment variable for node rank defined. Set as 0.
CUDA_VISIBLE_DEVICES: [0]
/home/mona/anaconda3/envs/zephyr/lib/python3.9/site-packages/pytorch_lightning/utilities/distributed.py:23: UserWarning: The dataloader, test dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` in the `DataLoader` init to improve performance.
warnings.warn(*args, **kwargs)
Testing: 0it [00:00, ?it/s]python3: /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:133: Eigen::internal::variable_if_dynamic<T, Value>::variable_if_dynamic(T) [with T = long int; int Value = 3]: Assertion `v == T(Value)' failed.
Aborted (core dumped)