I encountered a problem. When I started training with the smolvla model, it aborted. The following is the error message
environment:
ubuntu:22.04
memory:128G
GPU:RTX5090 48G CUDA12.8
pytorch:2.9.0.dev20250904+cu128
INFO 2025-09-08 14:07:07 ts/train.py:144 Creating optimizer and scheduler
INFO 2025-09-08 14:07:07 ts/train.py:156 Output dir: outputs/train/my_smolvla
INFO 2025-09-08 14:07:07 ts/train.py:159 cfg.steps=20000 (20K)
INFO 2025-09-08 14:07:07 ts/train.py:160 dataset.num_frames=11939 (12K)
INFO 2025-09-08 14:07:07 ts/train.py:161 dataset.num_episodes=50
INFO 2025-09-08 14:07:07 ts/train.py:162 num_learnable_params=99880992 (100M)
INFO 2025-09-08 14:07:07 ts/train.py:163 num_total_params=450046212 (450M)
INFO 2025-09-08 14:07:07 ts/train.py:177 cfg.num_workers=4
INFO 2025-09-08 14:07:07 ts/train.py:178 cfg.batch_size=64
INFO 2025-09-08 14:07:07 ts/train.py:204 Start offline training on a fixed dataset
Traceback (most recent call last):
File “/home/ubuntu/miniconda3/envs/env_isaaclab/lib/python3.10/site-packages/torch/utils/data/dataloader.py”, line 1275, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File “/home/ubuntu/miniconda3/envs/env_isaaclab/lib/python3.10/queue.py”, line 180, in get
self.not_empty.wait(remaining)
File “/home/ubuntu/miniconda3/envs/env_isaaclab/lib/python3.10/threading.py”, line 324, in wait
gotit = waiter.acquire(True, timeout)
File “/home/ubuntu/miniconda3/envs/env_isaaclab/lib/python3.10/site-packages/torch/utils/data/_utils/signal_handling.py”, line 73, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 21350) is killed by signal: Aborted.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File “/home/ubuntu/lerobot/src/lerobot/scripts/train.py”, line 297, in
main()
File “/home/ubuntu/lerobot/src/lerobot/scripts/train.py”, line 293, in main
train()
File “/home/ubuntu/lerobot/src/lerobot/configs/parser.py”, line 225, in wrapper_inner
response = fn(cfg, *args, **kwargs)
File “/home/ubuntu/lerobot/src/lerobot/scripts/train.py”, line 207, in train
batch = next(dl_iter)
File “/home/ubuntu/lerobot/src/lerobot/datasets/utils.py”, line 630, in cycle
yield next(iterator)
File “/home/ubuntu/miniconda3/envs/env_isaaclab/lib/python3.10/site-packages/torch/utils/data/dataloader.py”, line 732, in next
data = self._next_data()
File “/home/ubuntu/miniconda3/envs/env_isaaclab/lib/python3.10/site-packages/torch/utils/data/dataloader.py”, line 1482, in _next_data
idx, data = self._get_data()
File “/home/ubuntu/miniconda3/envs/env_isaaclab/lib/python3.10/site-packages/torch/utils/data/dataloader.py”, line 1434, in _get_data
success, data = self._try_get_data()
File “/home/ubuntu/miniconda3/envs/env_isaaclab/lib/python3.10/site-packages/torch/utils/data/dataloader.py”, line 1288, in _try_get_data
raise RuntimeError(
RuntimeError: DataLoader worker (pid(s) 21350) exited unexpectedly
I tried num_workers=0, which caused a aborted (core dumped).The memory is not fully used and the GPU is sufficient.How can I locate this problem and solve it? Sincerely waiting for your reply.