Hi.
I have a segmentation fault problem while training with 4 GPUs.
The Pytorch version is 1.10.0
`
import torch
print(torch.version)
1.10.0a0+3fd9dcf
The cuda version is 11.4 :
>>> print(torch.version.cuda)
11.4
The cudnn version is 8.2.2:
cat /usr/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 8
#define CUDNN_MINOR 2
#define CUDNN_PATCHLEVEL 2
–
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)
#endif /* CUDNN_VERSION_H */
`
I use gdb run python:
(gdb) run -m torch.distributed.launch --nproc_per_node 4 --master_port 9527 train.py --workers 4 --device 0,1,2,3 --sync-bn --batch-size 16 --data data/boards.yaml --img-size 1280 1280 --cfg cfg/training/yolov7-boards.yaml --resume /mnt/data2/train_codes/yolov7/runs/train/yolov7-72-boards42/weights/last.pt --name yolov7-72-boards --hyp data/hyp.scratch.p5.yaml --epoch 400
the print is :
train: Scanning ‘/mnt/data2/dataset/dataset-boards-2023.8.17/labels/train.cache’ images and labels… 5665 found, 0 missing, 17 empty, 0 corrupted: 100%|█| 5665/566
train: Caching images (2.3GB): 16%|███████████████▎ | 930/5665 [00:17<01:11, 66.57it/s]Corrupt JPEG data: 463 extraneous bytes before marker 0xd9
train: Caching images (2.5GB): 18%|████████████████▎ | 1003/5665 [00:18<01:13, 63.56it/s]Corrupt JPEG data: 723 extraneous bytes before marker 0xd9
the post is:
[7a78c0a56ca5:7035 :0:7035] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1755558ef4c8)
==== backtrace (tid: 7035) ====
0 /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x7fff22b9a824]
1 /opt/hpcx/ucx/lib/libucs.so.0(+0x2b9ff) [0x7fff22b9a9ff]
2 /opt/hpcx/ucx/lib/libucs.so.0(+0x2bd34) [0x7fff22b9ad34]
3 /opt/conda/bin/python3.8(_PyEval_EvalFrameDefault+0x7e4) [0x555555723e74]
4 /opt/conda/bin/python3.8(_PyEval_EvalCodeWithName+0x2c3) [0x55555570a433]
5 /opt/conda/bin/python3.8(_PyFunction_Vectorcall+0x378) [0x55555570b818]
6 /opt/conda/bin/python3.8(PyObject_Call+0x5e) [0x555555675b6e]
7 /opt/conda/bin/python3.8(_PyEval_EvalFrameDefault+0x21bf) [0x55555572584f]
8 /opt/conda/bin/python3.8(_PyEval_EvalCodeWithName+0x2c3) [0x55555570a433]
9 /opt/conda/bin/python3.8(_PyFunction_Vectorcall+0x378) [0x55555570b818]
10 /opt/conda/bin/python3.8(PyObject_Call+0x5e) [0x555555675b6e]
11 /opt/conda/bin/python3.8(_PyEval_EvalFrameDefault+0x21bf) [0x55555572584f]
12 /opt/conda/bin/python3.8(_PyEval_EvalCodeWithName+0x2c3) [0x55555570a433]
13 /opt/conda/bin/python3.8(_PyFunction_Vectorcall+0x378) [0x55555570b818]
14 /opt/conda/bin/python3.8(+0x1b7ed1) [0x55555570bed1]
15 /opt/conda/bin/python3.8(PyObject_Call+0x5e) [0x555555675b6e]
16 /opt/conda/bin/python3.8(_PyEval_EvalFrameDefault+0x21bf) [0x55555572584f]
17 /opt/conda/bin/python3.8(_PyEval_EvalCodeWithName+0x2c3) [0x55555570a433]
18 /opt/conda/bin/python3.8(_PyFunction_Vectorcall+0x378) [0x55555570b818]
19 /opt/conda/bin/python3.8(PyObject_Call+0x5e) [0x555555675b6e]
20 /opt/conda/bin/python3.8(_PyEval_EvalFrameDefault+0x21bf) [0x55555572584f]
21 /opt/conda/bin/python3.8(_PyEval_EvalCodeWithName+0x2c3) [0x55555570a433]
22 /opt/conda/bin/python3.8(_PyFunction_Vectorcall+0x378) [0x55555570b818]
23 /opt/conda/bin/python3.8(+0x1b7ed1) [0x55555570bed1]
24 /opt/conda/bin/python3.8(PyObject_Call+0x5e) [0x555555675b6e]
25 /opt/conda/bin/python3.8(_PyEval_EvalFrameDefault+0x21bf) [0x55555572584f]
26 /opt/conda/bin/python3.8(_PyEval_EvalCodeWithName+0x2c3) [0x55555570a433]
27 /opt/conda/bin/python3.8(_PyFunction_Vectorcall+0x378) [0x55555570b818]
28 /opt/conda/bin/python3.8(_PyEval_EvalFrameDefault+0xa63) [0x5555557240f3]
29 /opt/conda/bin/python3.8(_PyEval_EvalCodeWithName+0x2c3) [0x55555570a433]
30 /opt/conda/bin/python3.8(_PyFunction_Vectorcall+0x378) [0x55555570b818]
31 /opt/conda/bin/python3.8(PyObject_Call+0x5e) [0x555555675b6e]
32 /opt/conda/bin/python3.8(_PyEval_EvalFrameDefault+0x21bf) [0x55555572584f]
33 /opt/conda/bin/python3.8(_PyEval_EvalCodeWithName+0x2c3) [0x55555570a433]
34 /opt/conda/bin/python3.8(_PyFunction_Vectorcall+0x378) [0x55555570b818]
35 /opt/conda/bin/python3.8(_PyEval_EvalFrameDefault+0x947) [0x555555723fd7]
36 /opt/conda/bin/python3.8(_PyEval_EvalCodeWithName+0x2c3) [0x55555570a433]
37 /opt/conda/bin/python3.8(PyEval_EvalCodeEx+0x39) [0x55555570b499]
38 /opt/conda/bin/python3.8(PyEval_EvalCode+0x1b) [0x5555557a6ecb]
39 /opt/conda/bin/python3.8(+0x252f63) [0x5555557a6f63]
40 /opt/conda/bin/python3.8(+0x26f033) [0x5555557c3033]
41 /opt/conda/bin/python3.8(+0x274022) [0x5555557c8022]
42 /opt/conda/bin/python3.8(PyRun_SimpleFileExFlags+0x1b2) [0x5555557c8202]
43 /opt/conda/bin/python3.8(Py_RunMain+0x36d) [0x5555557c877d]
44 /opt/conda/bin/python3.8(Py_BytesMain+0x39) [0x5555557c8939]
45 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7ffff7c68083]
46 /opt/conda/bin/python3.8(+0x1e8f39) [0x55555573cf39]
=================================
Fatal Python error: Segmentation fault
Thread 0x00007ffe8e8d5700 (most recent call first):
<no Python frame>
Thread 0x00007ffe6dfff700 (most recent call first):
<no Python frame>
Thread 0x00007ffea8ffd700 (most recent call first):
<no Python frame>
Thread 0x00007ffe6cffd700 (most recent call first):
<no Python frame>
Thread 0x00007fff22972700 (most recent call first):
File "/opt/conda/lib/python3.8/threading.py", line 306 in wait
File "/opt/conda/lib/python3.8/threading.py", line 558 in wait
File "/opt/conda/lib/python3.8/site-packages/tqdm/_monitor.py", line 60 in run
File "/opt/conda/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/opt/conda/lib/python3.8/threading.py", line 890 in _bootstrap
Current thread 0x00007ffff7c3f740 (most recent call first):
File "/opt/conda/lib/python3.8/site-packages/torch/optim/sgd.py", line 105 in step
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28 in decorate_context
File "/opt/conda/lib/python3.8/site-packages/torch/optim/optimizer.py", line 88 in wrapper
File "/opt/conda/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 65 in wrapper
File "/opt/conda/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 285 in _maybe_opt_step
File "/opt/conda/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 338 in step
File "train.py", line 383 in train
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 351 in wrapper
File "train.py", line 624 in <module>
52/399 17.9G 0.01479 0.00392 0.0007133 0.01942 14 1280: 43%|██████████████████▌ | 153/355 [01:40<02:15, 1.49it/s]ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 2 (pid: 7035) of binary: /opt/conda/bin/python3.8
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in <module>
main()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 187, in main
launch(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 173, in launch
run(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 688, in run
elastic_launch(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
*************************************************
train.py FAILED
=================================================
Root Cause:
[0]:
time: 2023-09-13_17:28:43
rank: 2 (local_rank: 2)
exitcode: -11 (pid: 7035)
error_file: <N/A>
msg: "Signal 11 (SIGSEGV) received by PID 7035"
=================================================
Other Failures:
<NO_OTHER_FAILURES>
*************************************************
[Thread 0x7ffcd79d9700 (LWP 7032) exited]
[Thread 0x7ffcd81da700 (LWP 7031) exited]
I’m struggling to solve this problem. Please help me.