Training with Pytorch DDP: DataLoader worker (pid 2945877) is killed by signal: Segmentation fault

Anita · March 1, 2023, 10:32am

Hi,

I have implemented PyTorch DDP training for image classification through the official:

pytorch/examples/blob/main/imagenet/main.py

import argparse
import os
import random
import shutil
import time
import warnings
from enum import Enum

import torch
import torch.backends.cudnn as cudnn
import torch.distributed as dist
import torch.multiprocessing as mp
import torch.nn as nn
import torch.nn.parallel
import torch.optim
import torch.utils.data
import torch.utils.data.distributed
import torchvision.datasets as datasets
import torchvision.models as models
import torchvision.transforms as transforms

This file has been truncated. show original

Training is crashing with RuntimeError: DataLoader worker (pid 2273997) is killed by signal: Segmentation fault. The errors comes up whenever i use num_workers>0 at random epochs. When monitoring the CPU, the memory limit is not even being exceeded…

Things I have considered so far and that did not work:
Lowered batch size
Increased Docker shared memory
Updated PyTorch version
Verified that there is no corrupted file
I am using custom dataset - I made sure I do not use any lists, everything is numpy array.

I focused in changing the prefetch_factor Dataloader parameter to prefetch_factor=1, and now the code seems to be running without issues with num_workers>0.

Can someone please help me find out why this might be happening?

Thank you in advance.

ptrblck · March 1, 2023, 10:37am

It’s a bit hard to speculate why the segfault is raised. In case you can reproduce it in a reasonable amount of time, would it be possible to get a stacktrace from gdb via:

gdb --args python script.py args
...
run
...
bt

Anita · March 1, 2023, 11:06am

Thank you for your prompt answer @ptrblck. I am wondering can I gdbuse this with torchrun?
It shows the following error:

root@6ef594d11ca1:/home/nitmul/projects/meb-prostate-ai# gdb python
GNU gdb (Ubuntu 8.1.1-0ubuntu1) 8.1.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...(no debugging symbols found)...done.
(gdb) run torchrun --nnodes=1 --nproc_per_node=3 prostateai/gleasonai/model_2/main.py
Starting program: /usr/local/bin/python torchrun --nnodes=1 --nproc_per_node=3 prostateai/gleasonai/model_2/main.py
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
/usr/local/bin/python: can't open file 'torchrun': [Errno 2] No such file or directory
[Inferior 1 (process 36691) exited with code 02]

Anita · March 1, 2023, 2:07pm

hi @ptrblck. I did what you suggested and the output from backtrace (bt) shows as follow:


(gdb) bt
#0  0x00007f0c82e32d1f in __GI___select (nfds=0, readfds=0x0, writefds=0x0, exceptfds=0x0, timeout=0x7ffd3112e850) at ../sysdeps/unix/sysv/linux/select.c:41
#1  0x00000000005b97c4 in pysleep (secs=<optimized out>) at ../Modules/timemodule.c:1467
#2  time_sleep () at ../Modules/timemodule.c:235
#3  0x00000000005075bc in _PyCFunction_FastCallDict (kwargs=<optimized out>, nargs=<optimized out>, args=0x513c850, func_obj=<built-in method sleep of module object at remote 0x7f0c81a66ef8>) at ../Objects/methodobject.c:209
#4  _PyCFunction_FastCallKeywords (kwnames=<optimized out>, nargs=<optimized out>, stack=<optimized out>, func=<optimized out>) at ../Objects/methodobject.c:294
#5  call_function.lto_priv () at ../Python/ceval.c:4851
#6  0x00000000005092a4 in _PyEval_EvalFrameDefault () at ../Python/ceval.c:3335
#7  0x0000000000504fd4 in PyEval_EvalFrameEx (throwflag=0, 
    f=Frame 0x513c688, for file /usr/local/lib/python3.6/dist-packages/torch/distributed/elastic/agent/server/api.py, line 843, in _invoke_run (self=<LocalElasticAgent(_worker_group=<WorkerGroup at remote 0x7f0bb1ec5f10>, _remaining_restarts=0, _store=<torch._C._distributed_c10d.PrefixStore at remote 0x7f0bb1eec340>, _exit_barrier_timeout=300, _total_execution_time=0, _start_method='spawn', _pcontext=<SubprocessContext(name='default', entrypoint='/usr/bin/python3', args={0: ('-u', 'prostateai/gleasonai/model_2/main.py'), 1: ('-u', 'prostateai/gleasonai/model_2/main.py'), 2: ('-u', 'prostateai/gleasonai/model_2/main.py')}, envs={0: {'LOCAL_RANK': '0', 'RANK': '0', 'GROUP_RANK': '0', 'ROLE_RANK': '0', 'ROLE_NAME': 'default', 'LOCAL_WORLD_SIZE': '3', 'WORLD_SIZE': '3', 'GROUP_WORLD_SIZE': '1', 'ROLE_WORLD_SIZE': '3', 'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'TORCHELASTIC_RESTART_COUNT': '0', 'TORCHELASTIC_MAX_RESTARTS': '0', 'TORCHELASTIC_RUN_ID': 'none', 'TORCHELASTIC_USE_AGENT_STORE': 'True', 'NCCL_ASYNC_...(truncated)) at ../Python/ceval.c:754
#8  _PyEval_EvalCodeWithName.lto_priv.1836 () at ../Python/ceval.c:4166
#9  0x0000000000506d00 in fast_function.lto_priv () at ../Python/ceval.c:4992
#10 0x00000000005076ed in call_function.lto_priv () at ../Python/ceval.c:4872
#11 0x00000000005092a4 in _PyEval_EvalFrameDefault () at ../Python/ceval.c:3335
#12 0x0000000000504fd4 in PyEval_EvalFrameEx (throwflag=0, 
    f=Frame 0x50ecce8, for file /usr/local/lib/python3.6/dist-packages/torch/distributed/elastic/agent/server/api.py, line 709, in run (self=<LocalElasticAgent(_worker_group=<WorkerGroup at remote 0x7f0bb1ec5f10>, _remaining_restarts=0, _store=<torch._C._distributed_c10d.PrefixStore at remote 0x7f0bb1eec340>, _exit_barrier_timeout=300, _total_execution_time=0, _start_method='spawn', _pcontext=<SubprocessContext(name='default', entrypoint='/usr/bin/python3', args={0: ('-u', 'prostateai/gleasonai/model_2/main.py'), 1: ('-u', 'prostateai/gleasonai/model_2/main.py'), 2: ('-u', 'prostateai/gleasonai/model_2/main.py')}, envs={0: {'LOCAL_RANK': '0', 'RANK': '0', 'GROUP_RANK': '0', 'ROLE_RANK': '0', 'ROLE_NAME': 'default', 'LOCAL_WORLD_SIZE': '3', 'WORLD_SIZE': '3', 'GROUP_WORLD_SIZE': '1', 'ROLE_WORLD_SIZE': '3', 'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'TORCHELASTIC_RESTART_COUNT': '0', 'TORCHELASTIC_MAX_RESTARTS': '0', 'TORCHELASTIC_RUN_ID': 'none', 'TORCHELASTIC_USE_AGENT_STORE': 'True', 'NCCL_ASYNC_ERROR_HA...(truncated)) at ../Python/ceval.c:754
#13 _PyEval_EvalCodeWithName.lto_priv.1836 () at ../Python/ceval.c:4166
#14 0x0000000000586641 in PyEval_EvalCodeEx (closure=<optimized out>, kwdefs=<optimized out>, defcount=1, defs=0x7f0b46271fe8, kwcount=0, kws=0x7f0c8313a060, argcount=<optimized out>, args=0x7f0bb1f2db18, locals=0x0, globals=<optimized out>, _co=<optimized out>) at ../Python/ceval.c:4187
#15 function_call.lto_priv () at ../Objects/funcobject.c:604
#16 0x000000000059d65e in PyObject_Call () at ../Objects/abstract.c:2261
#17 0x000000000050a752 in do_call_core (kwdict={}, 
    callargs=(<LocalElasticAgent(_worker_group=<WorkerGroup at remote 0x7f0bb1ec5f10>, _remaining_restarts=0, _store=<torch._C._distributed_c10d.PrefixStore at remote 0x7f0bb1eec340>, _exit_barrier_timeout=300, _total_execution_time=0, _start_method='spawn', _pcontext=<SubprocessContext(name='default', entrypoint='/usr/bin/python3', args={0: ('-u', 'prostateai/gleasonai/model_2/main.py'), 1: ('-u', 'prostateai/gleasonai/model_2/main.py'), 2: ('-u', 'prostateai/gleasonai/model_2/main.py')}, envs={0: {'LOCAL_RANK': '0', 'RANK': '0', 'GROUP_RANK': '0', 'ROLE_RANK': '0', 'ROLE_NAME': 'default', 'LOCAL_WORLD_SIZE': '3', 'WORLD_SIZE': '3', 'GROUP_WORLD_SIZE': '1', 'ROLE_WORLD_SIZE': '3', 'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'TORCHELASTIC_RESTART_COUNT': '0', 'TORCHELASTIC_MAX_RESTARTS': '0', 'TORCHELASTIC_RUN_ID': 'none', 'TORCHELASTIC_USE_AGENT_STORE': 'True', 'NCCL_ASYNC_ERROR_HANDLING': '1', 'OMP_NUM_THREADS': '1', 'TORCHELASTIC_ERROR_FILE': '/tmp/torchelastic_w0i_nkib/none_v9f_c8k6/attempt_0/0/error.json'}, 1...(truncated), func=<function at remote 0x7f0bb1edac80>) at ../Python/ceval.c:5120
#18 _PyEval_EvalFrameDefault () at ../Python/ceval.c:3404
#19 0x0000000000504fd4 in PyEval_EvalFrameEx (throwflag=0, 
    f=Frame 0x50b84f8, for file /usr/local/lib/python3.6/dist-packages/torch/distributed/elastic/metrics/api.py, line 125, in wrapper (args=(<LocalElasticAgent(_worker_group=<WorkerGroup at remote 0x7f0bb1ec5f10>, _remaining_restarts=0, _store=<torch._C._distributed_c10d.PrefixStore at remote 0x7f0bb1eec340>, _exit_barrier_timeout=300, _total_execution_time=0, _start_method='spawn', _pcontext=<SubprocessContext(name='default', entrypoint='/usr/bin/python3', args={0: ('-u', 'prostateai/gleasonai/model_2/main.py'), 1: ('-u', 'prostateai/gleasonai/model_2/main.py'), 2: ('-u', 'prostateai/gleasonai/model_2/main.py')}, envs={0: {'LOCAL_RANK': '0', 'RANK': '0', 'GROUP_RANK': '0', 'ROLE_RANK': '0', 'ROLE_NAME': 'default', 'LOCAL_WORLD_SIZE': '3', 'WORLD_SIZE': '3', 'GROUP_WORLD_SIZE': '1', 'ROLE_WORLD_SIZE': '3', 'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'TORCHELASTIC_RESTART_COUNT': '0', 'TORCHELASTIC_MAX_RESTARTS': '0', 'TORCHELASTIC_RUN_ID': 'none', 'TORCHELASTIC_USE_AGENT_STORE': 'True', 'NCCL_ASYNC_ERROR_HA...(truncated)) at ../Python/ceval.c:754
#20 _PyEval_EvalCodeWithName.lto_priv.1836 () at ../Python/ceval.c:4166
#21 0x0000000000506d00 in fast_function.lto_priv () at ../Python/ceval.c:4992
#22 0x00000000005076ed in call_function.lto_priv () at ../Python/ceval.c:4872
#23 0x00000000005092a4 in _PyEval_EvalFrameDefault () at ../Python/ceval.c:3335
#24 0x00000000005069c8 in PyEval_EvalFrameEx (throwflag=0, 
    f=Frame 0x5131d98, for file /usr/local/lib/python3.6/dist-packages/torch/distributed/launcher/api.py, line 252, in launch_agent (config=<LaunchConfig(min_nodes=1, max_nodes=1, nproc_per_node=3, run_id='none', role='default', rdzv_endpoint='127.0.0.1:29500', rdzv_backend='static', rdzv_configs={'rank': 0, 'timeout': 900}, rdzv_timeout=-1, max_restarts=0, monitor_interval=5, start_method='spawn', log_dir=None, redirects=<Std(_value_=0, _name_='NONE', __objclass__=<EnumMeta(_generate_next_value_=<function at remote 0x7f0c81d16400>, __module__='torch.distributed.elastic.multiprocessing.api', from_str=<classmethod at remote 0x7f0bb1fc9048>, __doc__='An enumeration.', _member_names_=['NONE', 'OUT', 'ERR', 'ALL'], _member_map_={'NONE': <...>, 'OUT': <Std(_value_=1, _name_='OUT', __objclass__=<...>) at remote 0x7f0bb1fb0e48>, 'ERR': <Std(_value_=2, _name_='ERR', __objclass__=<...>) at remote 0x7f0bb1fb0e88>, 'ALL': <Std(_value_=3, _name_='ALL', __objclass__=<...>) at remote 0x7f0bb1fb0ec8>}, _member_type_=<type at remo...(truncated)) at ../Python/ceval.c:754
#25 _PyFunction_FastCall (globals=<optimized out>, nargs=85138840, args=<optimized out>, co=<optimized out>) at ../Python/ceval.c:4933
#26 fast_function.lto_priv () at ../Python/ceval.c:4968
#27 0x00000000005076ed in call_function.lto_priv () at ../Python/ceval.c:4872
#28 0x00000000005092a4 in _PyEval_EvalFrameDefault () at ../Python/ceval.c:3335
#29 0x0000000000504fd4 in PyEval_EvalFrameEx (throwflag=0, 
    f=Frame 0x7f0bb1f19788, for file /usr/local/lib/python3.6/dist-packages/torch/distributed/launcher/api.py, line 131, in __call__ (self=<elastic_launch(_config=<LaunchConfig(min_nodes=1, max_nodes=1, nproc_per_node=3, run_id='none', role='default', rdzv_endpoint='127.0.0.1:29500', rdzv_backend='static', rdzv_configs={'rank': 0, 'timeout': 900}, rdzv_timeout=-1, max_restarts=0, monitor_interval=5, start_method='spawn', log_dir=None, redirects=<Std(_value_=0, _name_='NONE', __objclass__=<EnumMeta(_generate_next_value_=<function at remote 0x7f0c81d16400>, __module__='torch.distributed.elastic.multiprocessing.api', fr---Type <return> to continue, or q <return> to quit---
om_str=<classmethod at remote 0x7f0bb1fc9048>, __doc__='An enumeration.', _member_names_=['NONE', 'OUT', 'ERR', 'ALL'], _member_map_={'NONE': <...>, 'OUT': <Std(_value_=1, _name_='OUT', __objclass__=<...>) at remote 0x7f0bb1fb0e48>, 'ERR': <Std(_value_=2, _name_='ERR', __objclass__=<...>) at remote 0x7f0bb1fb0e88>, 'ALL': <Std(_value_=3, _name_='ALL', __objclass__=<...>) at remote 0x7f0bb1fb0ec8>}, _mem...(truncated)) at ../Python/ceval.c:754
#30 _PyEval_EvalCodeWithName.lto_priv.1836 () at ../Python/ceval.c:4166
#31 0x00000000005062b2 in _PyFunction_FastCallDict () at ../Python/ceval.c:5075
#32 0x0000000000592461 in _PyObject_FastCallDict (kwargs=0x0, nargs=3, args=0x7ffd3112f6f0, func=<function at remote 0x7f0bb1ee2158>) at ../Objects/abstract.c:2310
#33 _PyObject_Call_Prepend (kwargs=0x0, args=<optimized out>, obj=<optimized out>, func=<function at remote 0x7f0bb1ee2158>) at ../Objects/abstract.c:2373
#34 method_call.lto_priv () at ../Objects/classobject.c:314
#35 0x00000000005479ef in PyObject_Call (kwargs=0x0, args=('-u', 'prostateai/gleasonai/model_2/main.py'), func=<method at remote 0x7f0c80996388>) at ../Objects/abstract.c:2261
#36 slot_tp_call () at ../Objects/typeobject.c:6207
#37 0x000000000059d65e in PyObject_Call () at ../Objects/abstract.c:2261
#38 0x000000000050a752 in do_call_core (kwdict=0x0, callargs=('-u', 'prostateai/gleasonai/model_2/main.py'), 
    func=<elastic_launch(_config=<LaunchConfig(min_nodes=1, max_nodes=1, nproc_per_node=3, run_id='none', role='default', rdzv_endpoint='127.0.0.1:29500', rdzv_backend='static', rdzv_configs={'rank': 0, 'timeout': 900}, rdzv_timeout=-1, max_restarts=0, monitor_interval=5, start_method='spawn', log_dir=None, redirects=<Std(_value_=0, _name_='NONE', __objclass__=<EnumMeta(_generate_next_value_=<function at remote 0x7f0c81d16400>, __module__='torch.distributed.elastic.multiprocessing.api', from_str=<classmethod at remote 0x7f0bb1fc9048>, __doc__='An enumeration.', _member_names_=['NONE', 'OUT', 'ERR', 'ALL'], _member_map_={'NONE': <...>, 'OUT': <Std(_value_=1, _name_='OUT', __objclass__=<...>) at remote 0x7f0bb1fb0e48>, 'ERR': <Std(_value_=2, _name_='ERR', __objclass__=<...>) at remote 0x7f0bb1fb0e88>, 'ALL': <Std(_value_=3, _name_='ALL', __objclass__=<...>) at remote 0x7f0bb1fb0ec8>}, _member_type_=<type at remote 0x9cd180>, _value2member_map_={0: <...>, 1: <...>, 2: <...>, 3: <...>}, NONE=<...>, OUT=<...>, ERR=<...>, A...(truncated)) at ../Python/ceval.c:5120
#39 _PyEval_EvalFrameDefault () at ../Python/ceval.c:3404
#40 0x00000000005069c8 in PyEval_EvalFrameEx (throwflag=0, 
    f=Frame 0x7f0bb1fe4828, for file /usr/local/lib/python3.6/dist-packages/torch/distributed/run.py, line 713, in run (args=<Namespace(nnodes='1', nproc_per_node='3', rdzv_backend='static', rdzv_endpoint='', rdzv_id='none', rdzv_conf='', standalone=False, max_restarts=0, monitor_interval=5, start_method='spawn', role='default', module=False, no_python=False, run_path=False, log_dir=None, redirects='0', tee='0', node_rank=0, master_addr='127.0.0.1', master_port=29500, training_script='prostateai/gleasonai/model_2/main.py', training_script_args=[]) at remote 0x7f0bb1eec128>, config=<LaunchConfig(min_nodes=1, max_nodes=1, nproc_per_node=3, run_id='none', role='default', rdzv_endpoint='127.0.0.1:29500', rdzv_backend='static', rdzv_configs={'rank': 0, 'timeout': 900}, rdzv_timeout=-1, max_restarts=0, monitor_interval=5, start_method='spawn', log_dir=None, redirects=<Std(_value_=0, _name_='NONE', __objclass__=<EnumMeta(_generate_next_value_=<function at remote 0x7f0c81d16400>, __module__='torch.distributed.elastic.multi...(truncated)) at ../Python/ceval.c:754
#41 _PyFunction_FastCall (globals=<optimized out>, nargs=139688207599656, args=<optimized out>, co=<optimized out>) at ../Python/ceval.c:4933
#42 fast_function.lto_priv () at ../Python/ceval.c:4968
#43 0x00000000005076ed in call_function.lto_priv () at ../Python/ceval.c:4872
#44 0x00000000005092a4 in _PyEval_EvalFrameDefault () at ../Python/ceval.c:3335
#45 0x0000000000504fd4 in PyEval_EvalFrameEx (throwflag=0, 
    f=Frame 0x7f0b46273708, for file /usr/local/lib/python3.6/dist-packages/torch/distributed/run.py, line 719, in main (args=<Namespace(nnodes='1', nproc_per_node='3', rdzv_backend='static', rdzv_endpoint='', rdzv_id='none', rdzv_conf='', standalone=False, max_restarts=0, monitor_interval=5, start_method='spawn', role='default', module=False, no_python=False, run_path=False, log_dir=None, redirects='0', tee='0', node_rank=0, master_addr='127.0.0.1', master_port=29500, training_script='prostateai/gleasonai/model_2/main.py', training_script_args=[]) at remote 0x7f0bb1eec128>)) at ../Python/ceval.c:754
#46 _PyEval_EvalCodeWithName.lto_priv.1836 () at ../Python/ceval.c:4166
#47 0x0000000000586641 in PyEval_EvalCodeEx (closure=<optimized out>, kwdefs=<optimized out>, defcount=1, defs=0x7f0c808abbf8, kwcount=0, kws=0x7f0c8313a060, argcount=<optimized out>, args=0x7f0c8313a060, locals=0x0, globals=<optimized out>, _co=<optimized out>) at ../Python/ceval.c:4187
#48 function_call.lto_priv () at ../Objects/funcobject.c:604
#49 0x000000000059d65e in PyObject_Call () at ../Objects/abstract.c:2261
#50 0x000000000050a752 in do_call_core (kwdict={}, callargs=(), func=<function at remote 0x7f0bb1ee2840>) at ../Python/ceval.c:5120
#51 _PyEval_EvalFrameDefault () at ../Python/ceval.c:3404
#52 0x0000000000504fd4 in PyEval_EvalFrameEx (throwflag=0, f=Frame 0x13a1c18, for file /usr/local/lib/python3.6/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py, line 345, in wrapper (args=(), kwargs={})) at ../Python/ceval.c:754
#53 _PyEval_EvalCodeWithName.lto_priv.1836 () at ../Python/ceval.c:4166
#54 0x0000000000506d00 in fast_function.lto_priv () at ../Python/ceval.c:4992
#55 0x00000000005076ed in call_function.lto_priv () at ../Python/ceval.c:4872
#56 0x00000000005092a4 in _PyEval_EvalFrameDefault () at ../Python/ceval.c:3335
#57 0x0000000000504fd4 in PyEval_EvalFrameEx (throwflag=0, f=Frame 0x11ef448, for file /usr/local/bin/torchrun, line 33, in <module> ()) at ../Python/ceval.c:754
#58 _PyEval_EvalCodeWithName.lto_priv.1836 () at ../Python/ceval.c:4166
#59 0x0000000000508103 in PyEval_EvalCodeEx (closure=0x0, kwdefs=0x0, defcount=0, defs=0x0, kwcount=0, kws=0x0, argcount=0, args=0x0, locals=<optimized out>, globals=<optimized out>, _co=<optimized out>) at ../Python/ceval.c:4187
#60 PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>) at ../Python/ceval.c:731
#61 0x0000000000634c32 in run_mod () at ../Python/pythonrun.c:1025
#62 0x0000000000634ce7 in PyRun_FileExFlags () at ../Python/pythonrun.c:978
#63 0x000000000063849f in PyRun_SimpleFileExFlags () at ../Python/pythonrun.c:419
#64 0x0000000000638675 in PyRun_AnyFileExFlags () at ../Python/pythonrun.c:81
#65 0x0000000000639041 in run_file (p_cf=0x7ffd3113036c, filename=<optimized out>, fp=<optimized out>) at ../Modules/main.c:340
#66 Py_Main () at ../Modules/main.c:810
#67 0x00000000004ad1f0 in main (argc=5, argv=0x7ffd31130568) at ../Programs/python.c:69

ptrblck · March 1, 2023, 7:43pm

Your backtrace points to a sleep operation and doesn’t show the segfault:

#0  0x00007f0c82e32d1f in __GI___select (nfds=0, readfds=0x0, writefds=0x0, exceptfds=0x0, timeout=0x7ffd3112e850) at ../sysdeps/unix/sysv/linux/select.c:41
#1  0x00000000005b97c4 in pysleep (secs=<optimized out>) at ../Modules/timemodule.c:1467
#2  time_sleep () at ../Modules/timemodule.c:235
#3  0x00000000005075bc in _PyCFunction_FastCallDict (kwargs=<optimized out>, nargs=<optimized out>, args=0x513c850, func_obj=<built-in method sleep of module object at remote 0x7f0c81a66ef8>) at ../Objects/methodobject.c:209
#4  _PyCFunction_FastCallKeywords (kwnames=<optimized out>, nargs=<optimized out>, stack=<optimized out>, func=<optimized out>) at ../Objects/methodobject.c:294
#5  call_function.lto_priv () at ../Python/ceval.c:4851
#6  0x00000000005092a4 in _PyEval_EvalFrameDefault () at ../Python/ceval.c:3335
#7  0x0000000000504fd4 in PyEval_EvalFrameEx (throwflag=0,

Did you add the sleep or did you just break at a random op?

Anita · March 2, 2023, 1:57pm

Hii @ptrblck, sometimes even though I get the segmentation fault the processes keep still going, that why for the previous output I had to manually stop them.

However, the below output from the bt is when the segfault showed and I did not interrupt the processes:
I would really appreaciate if you could check it out and let me know your thoughts.

(gdb) bt
#0  0x00007f4c03e84d1f in __GI___select (nfds=0, readfds=0x0, writefds=0x0, exceptfds=0x0, timeout=0x7ffee3ed8c90) at ../sysdeps/unix/sysv/linux/select.c:41
#1  0x00000000005b97c4 in pysleep (secs=<optimized out>) at ../Modules/timemodule.c:1467
#2  time_sleep () at ../Modules/timemodule.c:235
#3  0x00000000005075bc in _PyCFunction_FastCallDict (kwargs=<optimized out>, nargs=<optimized out>, args=0x4c63070, func_obj=<built-in method sleep of module object at remote 0x7f4c02ab8ef8>) at ../Objects/methodobject.c:209
#4  _PyCFunction_FastCallKeywords (kwnames=<optimized out>, nargs=<optimized out>, stack=<optimized out>, func=<optimized out>) at ../Objects/methodobject.c:294
#5  call_function.lto_priv () at ../Python/ceval.c:4851
#6  0x00000000005092a4 in _PyEval_EvalFrameDefault () at ../Python/ceval.c:3335
#7  0x0000000000504fd4 in PyEval_EvalFrameEx (throwflag=0, 
    f=Frame 0x4c62ea8, for file /usr/local/lib/python3.6/dist-packages/torch/distributed/elastic/agent/server/api.py, line 843, in _invoke_run (self=<LocalElasticAgent(_worker_group=<WorkerGroup at remote 0x7f4b32fd28e0>, _remaining_restarts=0, _store=<torch._C._distributed_c10d.PrefixStore at remote 0x7f4b32f82688>, _exit_barrier_timeout=300, _total_execution_time=0, _start_method='spawn', _pcontext=<SubprocessContext(name='default', entrypoint='/usr/bin/python3', args={0: ('-u', 'prostateai/gleasonai/model_2/main.py'), 1: ('-u', 'prostateai/gleasonai/model_2/main.py')}, envs={0: {'LOCAL_RANK': '0', 'RANK': '0', 'GROUP_RANK': '0', 'ROLE_RANK': '0', 'ROLE_NAME': 'default', 'LOCAL_WORLD_SIZE': '2', 'WORLD_SIZE': '2', 'GROUP_WORLD_SIZE': '1', 'ROLE_WORLD_SIZE': '2', 'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'TORCHELASTIC_RESTART_COUNT': '0', 'TORCHELASTIC_MAX_RESTARTS': '0', 'TORCHELASTIC_RUN_ID': 'none', 'TORCHELASTIC_USE_AGENT_STORE': 'True', 'NCCL_ASYNC_ERROR_HANDLING': '1', 'OMP_NUM_THREADS': '1', 'TORC...(truncated)) at ../Python/ceval.c:754
#8  _PyEval_EvalCodeWithName.lto_priv.1836 () at ../Python/ceval.c:4166
#9  0x0000000000506d00 in fast_function.lto_priv () at ../Python/ceval.c:4992
#10 0x00000000005076ed in call_function.lto_priv () at ../Python/ceval.c:4872
#11 0x00000000005092a4 in _PyEval_EvalFrameDefault () at ../Python/ceval.c:3335
#12 0x0000000000504fd4 in PyEval_EvalFrameEx (throwflag=0, 
    f=Frame 0x4c3d568, for file /usr/local/lib/python3.6/dist-packages/torch/distributed/elastic/agent/server/api.py, line 709, in run (self=<LocalElasticAgent(_worker_group=<WorkerGroup at remote 0x7f4b32fd28e0>, _remaining_restarts=0, _store=<torch._C._distributed_c10d.PrefixStore at remote 0x7f4b32f82688>, _exit_barrier_timeout=300, _total_execution_time=0, _start_method='spawn', _pcontext=<SubprocessContext(name='default', entrypoint='/usr/bin/python3', args={0: ('-u', 'prostateai/gleasonai/model_2/main.py'), 1: ('-u', 'prostateai/gleasonai/model_2/main.py')}, envs={0: {'LOCAL_RANK': '0', 'RANK': '0', 'GROUP_RANK': '0', 'ROLE_RANK': '0', 'ROLE_NAME': 'default', 'LOCAL_WORLD_SIZE': '2', 'WORLD_SIZE': '2', 'GROUP_WORLD_SIZE': '1', 'ROLE_WORLD_SIZE': '2', 'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'TORCHELASTIC_RESTART_COUNT': '0', 'TORCHELASTIC_MAX_RESTARTS': '0', 'TORCHELASTIC_RUN_ID': 'none', 'TORCHELASTIC_USE_AGENT_STORE': 'True', 'NCCL_ASYNC_ERROR_HANDLING': '1', 'OMP_NUM_THREADS': '1', 'TORCHELASTIC...(truncated)) at ../Python/ceval.c:754
#13 _PyEval_EvalCodeWithName.lto_priv.1836 () at ../Python/ceval.c:4166
#14 0x0000000000586641 in PyEval_EvalCodeEx (closure=<optimized out>, kwdefs=<optimized out>, defcount=1, defs=0x7f4b32fc5958, kwcount=0, kws=0x7f4c0418c060, argcount=<optimized out>, args=0x7f4ac72c1760, locals=0x0, globals=<optimized out>, _co=<optimized out>) at ../Python/ceval.c:4187
#15 function_call.lto_priv () at ../Objects/funcobject.c:604
#16 0x000000000059d65e in PyObject_Call () at ../Objects/abstract.c:2261
#17 0x000000000050a752 in do_call_core (kwdict={}, 
    callargs=(<LocalElasticAgent(_worker_group=<WorkerGroup at remote 0x7f4b32fd28e0>, _remaining_restarts=0, _store=<torch._C._distributed_c10d.PrefixStore at remote 0x7f4b32f82688>, _exit_barrier_timeout=300, _total_execution_time=0, _start_method='spawn', _pcontext=<SubprocessContext(name='default', entrypoint='/usr/bin/python3', args={0: ('-u', 'prostateai/gleasonai/model_2/main.py'), 1: ('-u', 'prostateai/gleasonai/model_2/main.py')}, envs={0: {'LOCAL_RANK': '0', 'RANK': '0', 'GROUP_RANK': '0', 'ROLE_RANK': '0', 'ROLE_NAME': 'default', 'LOCAL_WORLD_SIZE': '2', 'WORLD_SIZE': '2', 'GROUP_WORLD_SIZE': '1', 'ROLE_WORLD_SIZE': '2', 'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'TORCHELASTIC_RESTART_COUNT': '0', 'TORCHELASTIC_MAX_RESTARTS': '0', 'TORCHELASTIC_RUN_ID': 'none', 'TORCHELASTIC_USE_AGENT_STORE': 'True', 'NCCL_ASYNC_ERROR_HANDLING': '1', 'OMP_NUM_THREADS': '1', 'TORCHELASTIC_ERROR_FILE': '/tmp/torchelastic_sd2v1fbe/none_k3qyzzyt/attempt_0/0/error.json'}, 1: {'LOCAL_RANK': '1', 'RANK': '1', 'GROUP_RANK': '0...(truncated), func=<function at remote 0x7f4b32f69d08>) at ../Python/ceval.c:5120
#18 _PyEval_EvalFrameDefault () at ../Python/ceval.c:3404
#19 0x0000000000504fd4 in PyEval_EvalFrameEx (throwflag=0, 
    f=Frame 0x4bcd358, for file /usr/local/lib/python3.6/dist-packages/torch/distributed/elastic/metrics/api.py, line 125, in wrapper (args=(<LocalElasticAgent(_worker_group=<WorkerGroup at remote 0x7f4b32fd28e0>, _remaining_restarts=0, _store=<torch._C._distributed_c10d.PrefixStore at remote 0x7f4b32f82688>, _exit_barrier_timeout=300, _total_execution_time=0, _start_method='spawn', _pcontext=<SubprocessContext(name='default', entrypoint='/usr/bin/python3', args={0: ('-u', 'prostateai/gleasonai/model_2/main.py'), 1: ('-u', 'prostateai/gleasonai/model_2/main.py')}, envs={0: {'LOCAL_RANK': '0', 'RANK': '0', 'GROUP_RANK': '0', 'ROLE_RANK': '0', 'ROLE_NAME': 'default', 'LOCAL_WORLD_SIZE': '2', 'WORLD_SIZE': '2', 'GROUP_WORLD_SIZE': '1', 'ROLE_WORLD_SIZE': '2', 'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'TORCHELASTIC_RESTART_COUNT': '0', 'TORCHELASTIC_MAX_RESTARTS': '0', 'TORCHELASTIC_RUN_ID': 'none', 'TORCHELASTIC_USE---Type <return> to continue, or q <return> to quit---
_AGENT_STORE': 'True', 'NCCL_ASYNC_ERROR_HANDLING': '1', 'OMP_NUM_THREADS': '1', 'TORCHELASTIC...(truncated)) at ../Python/ceval.c:754
#20 _PyEval_EvalCodeWithName.lto_priv.1836 () at ../Python/ceval.c:4166
#21 0x0000000000506d00 in fast_function.lto_priv () at ../Python/ceval.c:4992
#22 0x00000000005076ed in call_function.lto_priv () at ../Python/ceval.c:4872
#23 0x00000000005092a4 in _PyEval_EvalFrameDefault () at ../Python/ceval.c:3335
#24 0x00000000005069c8 in PyEval_EvalFrameEx (throwflag=0, 
    f=Frame 0x4c49e28, for file /usr/local/lib/python3.6/dist-packages/torch/distributed/launcher/api.py, line 252, in launch_agent (config=<LaunchConfig(min_nodes=1, max_nodes=1, nproc_per_node=2, run_id='none', role='default', rdzv_endpoint='127.0.0.1:29500', rdzv_backend='static', rdzv_configs={'rank': 0, 'timeout': 900}, rdzv_timeout=-1, max_restarts=0, monitor_interval=5, start_method='spawn', log_dir=None, redirects=<Std(_value_=0, _name_='NONE', __objclass__=<EnumMeta(_generate_next_value_=<function at remote 0x7f4c02d67488>, __module__='torch.distributed.elastic.multiprocessing.api', from_str=<classmethod at remote 0x7f4b33051eb8>, __doc__='An enumeration.', _member_names_=['NONE', 'OUT', 'ERR', 'ALL'], _member_map_={'NONE': <...>, 'OUT': <Std(_value_=1, _name_='OUT', __objclass__=<...>) at remote 0x7f4b32fd0bc8>, 'ERR': <Std(_value_=2, _name_='ERR', __objclass__=<...>) at remote 0x7f4b32fd0c08>, 'ALL': <Std(_value_=3, _name_='ALL', __objclass__=<...>) at remote 0x7f4b32fd0c48>}, _member_type_=<type at remo...(truncated)) at ../Python/ceval.c:754
#25 _PyFunction_FastCall (globals=<optimized out>, nargs=79994408, args=<optimized out>, co=<optimized out>) at ../Python/ceval.c:4933
#26 fast_function.lto_priv () at ../Python/ceval.c:4968
#27 0x00000000005076ed in call_function.lto_priv () at ../Python/ceval.c:4872
#28 0x00000000005092a4 in _PyEval_EvalFrameDefault () at ../Python/ceval.c:3335
#29 0x0000000000504fd4 in PyEval_EvalFrameEx (throwflag=0, 
    f=Frame 0x7f4b32fe15b8, for file /usr/local/lib/python3.6/dist-packages/torch/distributed/launcher/api.py, line 131, in __call__ (self=<elastic_launch(_config=<LaunchConfig(min_nodes=1, max_nodes=1, nproc_per_node=2, run_id='none', role='default', rdzv_endpoint='127.0.0.1:29500', rdzv_backend='static', rdzv_configs={'rank': 0, 'timeout': 900}, rdzv_timeout=-1, max_restarts=0, monitor_interval=5, start_method='spawn', log_dir=None, redirects=<Std(_value_=0, _name_='NONE', __objclass__=<EnumMeta(_generate_next_value_=<function at remote 0x7f4c02d67488>, __module__='torch.distributed.elastic.multiprocessing.api', from_str=<classmethod at remote 0x7f4b33051eb8>, __doc__='An enumeration.', _member_names_=['NONE', 'OUT', 'ERR', 'ALL'], _member_map_={'NONE': <...>, 'OUT': <Std(_value_=1, _name_='OUT', __objclass__=<...>) at remote 0x7f4b32fd0bc8>, 'ERR': <Std(_value_=2, _name_='ERR', __objclass__=<...>) at remote 0x7f4b32fd0c08>, 'ALL': <Std(_value_=3, _name_='ALL', __objclass__=<...>) at remote 0x7f4b32fd0c48>}, _mem...(truncated)) at ../Python/ceval.c:754
#30 _PyEval_EvalCodeWithName.lto_priv.1836 () at ../Python/ceval.c:4166
#31 0x00000000005062b2 in _PyFunction_FastCallDict () at ../Python/ceval.c:5075
#32 0x0000000000592461 in _PyObject_FastCallDict (kwargs=0x0, nargs=3, args=0x7ffee3ed9b30, func=<function at remote 0x7f4b32f7a0d0>) at ../Objects/abstract.c:2310
#33 _PyObject_Call_Prepend (kwargs=0x0, args=<optimized out>, obj=<optimized out>, func=<function at remote 0x7f4b32f7a0d0>) at ../Objects/abstract.c:2373
#34 method_call.lto_priv () at ../Objects/classobject.c:314
#35 0x00000000005479ef in PyObject_Call (kwargs=0x0, args=('-u', 'prostateai/gleasonai/model_2/main.py'), func=<method at remote 0x7f4c019d12c8>) at ../Objects/abstract.c:2261
#36 slot_tp_call () at ../Objects/typeobject.c:6207
#37 0x000000000059d65e in PyObject_Call () at ../Objects/abstract.c:2261
#38 0x000000000050a752 in do_call_core (kwdict=0x0, callargs=('-u', 'prostateai/gleasonai/model_2/main.py'), 
    func=<elastic_launch(_config=<LaunchConfig(min_nodes=1, max_nodes=1, nproc_per_node=2, run_id='none', role='default', rdzv_endpoint='127.0.0.1:29500', rdzv_backend='static', rdzv_configs={'rank': 0, 'timeout': 900}, rdzv_timeout=-1, max_restarts=0, monitor_interval=5, start_method='spawn', log_dir=None, redirects=<Std(_value_=0, _name_='NONE', __objclass__=<EnumMeta(_generate_next_value_=<function at remote 0x7f4c02d67488>, __module__='torch.distributed.elastic.multiprocessing.api', from_str=<classmethod at remote 0x7f4b33051eb8>, __doc__='An enumeration.', _member_names_=['NONE', 'OUT', 'ERR', 'ALL'], _member_map_={'NONE': <...>, 'OUT': <Std(_value_=1, _name_='OUT', __objclass__=<...>) at remote 0x7f4b32fd0bc8>, 'ERR': <Std(_value_=2, _name_='ERR', __objclass__=<...>) at remote 0x7f4b32fd0c08>, 'ALL': <Std(_value_=3, _name_='ALL', __objclass__=<...>) at remote 0x7f4b32fd0c48>}, _member_type_=<type at remote 0x9cd180>, _value2member_map_={0: <...>, 1: <...>, 2: <...>, 3: <...>}, NONE=<...>, OUT=<...>, ERR=<...>, A...(truncated)) at ../Python/ceval.c:5120
#39 _PyEval_EvalFrameDefault () at ../Python/ceval.c:3404
#40 0x00000000005069c8 in PyEval_EvalFrameEx (throwflag=0, 
    f=Frame 0x7f4b32f7f048, for file /usr/local/lib/python3.6/dist-packages/torch/distributed/run.py, line 713, in run (args=<Namespace(nnodes='1', nproc_per_node='2', rdzv_backend='static', rdzv_endpoint='', rdzv_id='none', rdzv_conf='', standalone=False, max_restarts=0, monitor_interval=5, start_method='spawn---Type <return> to continue, or q <return> to quit---
', role='default', module=False, no_python=False, run_path=False, log_dir=None, redirects='0', tee='0', node_rank=0, master_addr='127.0.0.1', master_port=29500, training_script='prostateai/gleasonai/model_2/main.py', training_script_args=[]) at remote 0x7f4b32f82470>, config=<LaunchConfig(min_nodes=1, max_nodes=1, nproc_per_node=2, run_id='none', role='default', rdzv_endpoint='127.0.0.1:29500', rdzv_backend='static', rdzv_configs={'rank': 0, 'timeout': 900}, rdzv_timeout=-1, max_restarts=0, monitor_interval=5, start_method='spawn', log_dir=None, redirects=<Std(_value_=0, _name_='NONE', __objclass__=<EnumMeta(_generate_next_value_=<function at remote 0x7f4c02d67488>, __module__='torch.distributed.elastic.multi...(truncated)) at ../Python/ceval.c:754
#41 _PyFunction_FastCall (globals=<optimized out>, nargs=139960954384456, args=<optimized out>, co=<optimized out>) at ../Python/ceval.c:4933
#42 fast_function.lto_priv () at ../Python/ceval.c:4968
#43 0x00000000005076ed in call_function.lto_priv () at ../Python/ceval.c:4872
#44 0x00000000005092a4 in _PyEval_EvalFrameDefault () at ../Python/ceval.c:3335
#45 0x0000000000504fd4 in PyEval_EvalFrameEx (throwflag=0, 
    f=Frame 0x7f4b3301ea48, for file /usr/local/lib/python3.6/dist-packages/torch/distributed/run.py, line 719, in main (args=<Namespace(nnodes='1', nproc_per_node='2', rdzv_backend='static', rdzv_endpoint='', rdzv_id='none', rdzv_conf='', standalone=False, max_restarts=0, monitor_interval=5, start_method='spawn', role='default', module=False, no_python=False, run_path=False, log_dir=None, redirects='0', tee='0', node_rank=0, master_addr='127.0.0.1', master_port=29500, training_script='prostateai/gleasonai/model_2/main.py', training_script_args=[]) at remote 0x7f4b32f82470>)) at ../Python/ceval.c:754
#46 _PyEval_EvalCodeWithName.lto_priv.1836 () at ../Python/ceval.c:4166
#47 0x0000000000586641 in PyEval_EvalCodeEx (closure=<optimized out>, kwdefs=<optimized out>, defcount=1, defs=0x7f4c018fdae0, kwcount=0, kws=0x7f4c0418c060, argcount=<optimized out>, args=0x7f4c0418c060, locals=0x0, globals=<optimized out>, _co=<optimized out>) at ../Python/ceval.c:4187
#48 function_call.lto_priv () at ../Objects/funcobject.c:604
#49 0x000000000059d65e in PyObject_Call () at ../Objects/abstract.c:2261
#50 0x000000000050a752 in do_call_core (kwdict={}, callargs=(), func=<function at remote 0x7f4b32f7a7b8>) at ../Python/ceval.c:5120
#51 _PyEval_EvalFrameDefault () at ../Python/ceval.c:3404
#52 0x0000000000504fd4 in PyEval_EvalFrameEx (throwflag=0, f=Frame 0x4710f18, for file /usr/local/lib/python3.6/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py, line 345, in wrapper (args=(), kwargs={})) at ../Python/ceval.c:754
#53 _PyEval_EvalCodeWithName.lto_priv.1836 () at ../Python/ceval.c:4166
#54 0x0000000000506d00 in fast_function.lto_priv () at ../Python/ceval.c:4992
#55 0x00000000005076ed in call_function.lto_priv () at ../Python/ceval.c:4872
#56 0x00000000005092a4 in _PyEval_EvalFrameDefault () at ../Python/ceval.c:3335
#57 0x0000000000504fd4 in PyEval_EvalFrameEx (throwflag=0, f=Frame 0xd7bcc8, for file /usr/local/bin/torchrun, line 33, in <module> ()) at ../Python/ceval.c:754
#58 _PyEval_EvalCodeWithName.lto_priv.1836 () at ../Python/ceval.c:4166
#59 0x0000000000508103 in PyEval_EvalCodeEx (closure=0x0, kwdefs=0x0, defcount=0, defs=0x0, kwcount=0, kws=0x0, argcount=0, args=0x0, locals=<optimized out>, globals=<optimized out>, _co=<optimized out>) at ../Python/ceval.c:4187
#60 PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>) at ../Python/ceval.c:731
#61 0x0000000000634c32 in run_mod () at ../Python/pythonrun.c:1025
#62 0x0000000000634ce7 in PyRun_FileExFlags () at ../Python/pythonrun.c:978
#63 0x000000000063849f in PyRun_SimpleFileExFlags () at ../Python/pythonrun.c:419
#64 0x0000000000638675 in PyRun_AnyFileExFlags () at ../Python/pythonrun.c:81
#65 0x0000000000639041 in run_file (p_cf=0x7ffee3eda7ac, filename=<optimized out>, fp=<optimized out>) at ../Modules/main.c:340
#66 Py_Main () at ../Modules/main.c:810
#67 0x00000000004ad1f0 in main (argc=5, argv=0x7ffee3eda9a8) at ../Programs/python.c:69

Anita · March 2, 2023, 2:01pm

@ptrblck also when I check the output with py-bt it shows as:


(gdb) 
(gdb) py-bt
Traceback (most recent call first):
  <built-in method sleep of module object at remote 0x7f4c02ab8ef8>
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/elastic/agent/server/api.py", line 843, in _invoke_run
    time.sleep(monitor_interval)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
    result = self._invoke_run(role)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launcher/api.py", line 252, in launch_agent
    result = agent.run()
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/run.py", line 719, in main
    run(args)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.10.2+cu113', 'console_scripts', 'torchrun')())

Anita · March 2, 2023, 2:26pm

@ptrblck also some other outputs, if it is helpful to you. Thank you in advance.


(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7f4c04378740 (LWP 7371) "torchrun" 0x00000000005092a4 in _PyEval_EvalFrameDefault () at ../Python/ceval.c:3335
  2    Thread 0x7f4b0026e700 (LWP 7414) "torchrun" 0x00007f4c03e82bb9 in __GI___poll (fds=0x7f4b04001030, nfds=6, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
  3    Thread 0x7f4b00a6f700 (LWP 7415) "torchrun" 0x00007f4c03e82bb9 in __GI___poll (fds=0x7f4ac0000b40, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29


(gdb) t a a py-bt

Thread 3 (Thread 0x7f4b00a6f700 (LWP 7415)):
Unable to locate python frame

Thread 2 (Thread 0x7f4b0026e700 (LWP 7414)):
Unable to locate python frame

Thread 1 (Thread 0x7f4c04378740 (LWP 7371)):
Traceback (most recent call first):
  <built-in method sleep of module object at remote 0x7f4c02ab8ef8>
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/elastic/agent/server/api.py", line 843, in _invoke_run
    time.sleep(monitor_interval)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
    result = self._invoke_run(role)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launcher/api.py", line 252, in launch_agent
    result = agent.run()
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/run.py", line 719, in main
    run(args)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.10.2+cu113', 'console_scripts', 'torchrun')())


(gdb) py-list
  28    globals().setdefault('load_entry_point', importlib_load_entry_point)
  29    
  30    
  31    if __name__ == '__main__':
  32        sys.argv[0] = re.sub(r'(-script\.pyw?|\.exe)?$', '', sys.argv[0])
 >33        sys.exit(load_entry_point('torch==1.10.2+cu113', 'console_scripts', 'torchrun')())

ptrblck · March 2, 2023, 8:12pm

All stacktraces just point to the sleep calls and do not show the segfault.

Do you still see the DataLoader error claiming a worker was killed or is the entire script just running fine?