Segmentation fault while training yolov7

Hi.
I have a segmentation fault problem while training with 4 GPUs.
The Pytorch version is 1.10.0
`

import torch

print(torch.version)
1.10.0a0+3fd9dcf
The cuda version is 11.4 :>>> print(torch.version.cuda)
11.4
The cudnn version is 8.2.2:cat /usr/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 8
#define CUDNN_MINOR 2
#define CUDNN_PATCHLEVEL 2

#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)
#endif /* CUDNN_VERSION_H */
`

I use gdb run python:

(gdb) run -m torch.distributed.launch --nproc_per_node 4 --master_port 9527 train.py --workers 4 --device 0,1,2,3 --sync-bn --batch-size 16 --data data/boards.yaml --img-size 1280 1280 --cfg cfg/training/yolov7-boards.yaml --resume /mnt/data2/train_codes/yolov7/runs/train/yolov7-72-boards42/weights/last.pt --name yolov7-72-boards --hyp data/hyp.scratch.p5.yaml --epoch 400

the print is :

train: Scanning ‘/mnt/data2/dataset/dataset-boards-2023.8.17/labels/train.cache’ images and labels… 5665 found, 0 missing, 17 empty, 0 corrupted: 100%|█| 5665/566
train: Caching images (2.3GB): 16%|███████████████▎ | 930/5665 [00:17<01:11, 66.57it/s]Corrupt JPEG data: 463 extraneous bytes before marker 0xd9
train: Caching images (2.5GB): 18%|████████████████▎ | 1003/5665 [00:18<01:13, 63.56it/s]Corrupt JPEG data: 723 extraneous bytes before marker 0xd9

the post is:

[7a78c0a56ca5:7035 :0:7035] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1755558ef4c8)
==== backtrace (tid:   7035) ====
 0  /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x7fff22b9a824]
 1  /opt/hpcx/ucx/lib/libucs.so.0(+0x2b9ff) [0x7fff22b9a9ff]
 2  /opt/hpcx/ucx/lib/libucs.so.0(+0x2bd34) [0x7fff22b9ad34]
 3  /opt/conda/bin/python3.8(_PyEval_EvalFrameDefault+0x7e4) [0x555555723e74]
 4  /opt/conda/bin/python3.8(_PyEval_EvalCodeWithName+0x2c3) [0x55555570a433]
 5  /opt/conda/bin/python3.8(_PyFunction_Vectorcall+0x378) [0x55555570b818]
 6  /opt/conda/bin/python3.8(PyObject_Call+0x5e) [0x555555675b6e]
 7  /opt/conda/bin/python3.8(_PyEval_EvalFrameDefault+0x21bf) [0x55555572584f]
 8  /opt/conda/bin/python3.8(_PyEval_EvalCodeWithName+0x2c3) [0x55555570a433]
 9  /opt/conda/bin/python3.8(_PyFunction_Vectorcall+0x378) [0x55555570b818]
10  /opt/conda/bin/python3.8(PyObject_Call+0x5e) [0x555555675b6e]
11  /opt/conda/bin/python3.8(_PyEval_EvalFrameDefault+0x21bf) [0x55555572584f]
12  /opt/conda/bin/python3.8(_PyEval_EvalCodeWithName+0x2c3) [0x55555570a433]
13  /opt/conda/bin/python3.8(_PyFunction_Vectorcall+0x378) [0x55555570b818]
14  /opt/conda/bin/python3.8(+0x1b7ed1) [0x55555570bed1]
15  /opt/conda/bin/python3.8(PyObject_Call+0x5e) [0x555555675b6e]
16  /opt/conda/bin/python3.8(_PyEval_EvalFrameDefault+0x21bf) [0x55555572584f]
17  /opt/conda/bin/python3.8(_PyEval_EvalCodeWithName+0x2c3) [0x55555570a433]
18  /opt/conda/bin/python3.8(_PyFunction_Vectorcall+0x378) [0x55555570b818]
19  /opt/conda/bin/python3.8(PyObject_Call+0x5e) [0x555555675b6e]
20  /opt/conda/bin/python3.8(_PyEval_EvalFrameDefault+0x21bf) [0x55555572584f]
21  /opt/conda/bin/python3.8(_PyEval_EvalCodeWithName+0x2c3) [0x55555570a433]
22  /opt/conda/bin/python3.8(_PyFunction_Vectorcall+0x378) [0x55555570b818]
23  /opt/conda/bin/python3.8(+0x1b7ed1) [0x55555570bed1]
24  /opt/conda/bin/python3.8(PyObject_Call+0x5e) [0x555555675b6e]
25  /opt/conda/bin/python3.8(_PyEval_EvalFrameDefault+0x21bf) [0x55555572584f]
26  /opt/conda/bin/python3.8(_PyEval_EvalCodeWithName+0x2c3) [0x55555570a433]
27  /opt/conda/bin/python3.8(_PyFunction_Vectorcall+0x378) [0x55555570b818]
28  /opt/conda/bin/python3.8(_PyEval_EvalFrameDefault+0xa63) [0x5555557240f3]
29  /opt/conda/bin/python3.8(_PyEval_EvalCodeWithName+0x2c3) [0x55555570a433]
30  /opt/conda/bin/python3.8(_PyFunction_Vectorcall+0x378) [0x55555570b818]
31  /opt/conda/bin/python3.8(PyObject_Call+0x5e) [0x555555675b6e]
32  /opt/conda/bin/python3.8(_PyEval_EvalFrameDefault+0x21bf) [0x55555572584f]
33  /opt/conda/bin/python3.8(_PyEval_EvalCodeWithName+0x2c3) [0x55555570a433]
34  /opt/conda/bin/python3.8(_PyFunction_Vectorcall+0x378) [0x55555570b818]
35  /opt/conda/bin/python3.8(_PyEval_EvalFrameDefault+0x947) [0x555555723fd7]
36  /opt/conda/bin/python3.8(_PyEval_EvalCodeWithName+0x2c3) [0x55555570a433]
37  /opt/conda/bin/python3.8(PyEval_EvalCodeEx+0x39) [0x55555570b499]
38  /opt/conda/bin/python3.8(PyEval_EvalCode+0x1b) [0x5555557a6ecb]
39  /opt/conda/bin/python3.8(+0x252f63) [0x5555557a6f63]
40  /opt/conda/bin/python3.8(+0x26f033) [0x5555557c3033]
41  /opt/conda/bin/python3.8(+0x274022) [0x5555557c8022]
42  /opt/conda/bin/python3.8(PyRun_SimpleFileExFlags+0x1b2) [0x5555557c8202]
43  /opt/conda/bin/python3.8(Py_RunMain+0x36d) [0x5555557c877d]
44  /opt/conda/bin/python3.8(Py_BytesMain+0x39) [0x5555557c8939]
45  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7ffff7c68083]
46  /opt/conda/bin/python3.8(+0x1e8f39) [0x55555573cf39]
=================================
Fatal Python error: Segmentation fault

Thread 0x00007ffe8e8d5700 (most recent call first):
<no Python frame>

Thread 0x00007ffe6dfff700 (most recent call first):
<no Python frame>

Thread 0x00007ffea8ffd700 (most recent call first):
<no Python frame>

Thread 0x00007ffe6cffd700 (most recent call first):
<no Python frame>

Thread 0x00007fff22972700 (most recent call first):
  File "/opt/conda/lib/python3.8/threading.py", line 306 in wait
  File "/opt/conda/lib/python3.8/threading.py", line 558 in wait
  File "/opt/conda/lib/python3.8/site-packages/tqdm/_monitor.py", line 60 in run
  File "/opt/conda/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/opt/conda/lib/python3.8/threading.py", line 890 in _bootstrap

Current thread 0x00007ffff7c3f740 (most recent call first):
  File "/opt/conda/lib/python3.8/site-packages/torch/optim/sgd.py", line 105 in step
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28 in decorate_context
  File "/opt/conda/lib/python3.8/site-packages/torch/optim/optimizer.py", line 88 in wrapper
  File "/opt/conda/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 65 in wrapper
  File "/opt/conda/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 285 in _maybe_opt_step
  File "/opt/conda/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 338 in step
  File "train.py", line 383 in train
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 351 in wrapper
  File "train.py", line 624 in <module>
    52/399     17.9G   0.01479   0.00392 0.0007133   0.01942        14      1280:  43%|██████████████████▌                        | 153/355 [01:40<02:15,  1.49it/s]ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 2 (pid: 7035) of binary: /opt/conda/bin/python3.8
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 187, in main
    launch(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 173, in launch
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 688, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
*************************************************
                 train.py FAILED                 
=================================================
Root Cause:
[0]:
  time: 2023-09-13_17:28:43
  rank: 2 (local_rank: 2)
  exitcode: -11 (pid: 7035)
  error_file: <N/A>
  msg: "Signal 11 (SIGSEGV) received by PID 7035"
=================================================
Other Failures:
  <NO_OTHER_FAILURES>
*************************************************

[Thread 0x7ffcd79d9700 (LWP 7032) exited]
[Thread 0x7ffcd81da700 (LWP 7031) exited]

I’m struggling to solve this problem. Please help me.

Hello@xinlin-xiao, have you found a solution yet? I have encountered the same problem.

Sorry,I can not fix it!Just useing another server to reduce errors.