@ptrblck this is a slightly different issue than our previous conversations…I am running a simple training script with DDP. I tried what you suggested but the output of gdb is empty…literally that’s what gdb says. I am a bit lost
----> about to cleanup worker with rank 0
clean up done successfully! 0
Traceback (most recent call last):
File "ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 338, in <module>
main_distributed()
File "ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 230, in main_distributed
spawn_return = mp.spawn(fn=train, args=(opts,), nprocs=opts.world_size)
File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 105, in join
raise Exception(
Exception: process 0 terminated with signal SIGSEGV
[Thread 0x2aaaaaaf0ec0 (LWP 25285) exited]
[Inferior 1 (process 25285) exited with code 01]
Missing separate debuginfos, use: debuginfo-install glibc-2.17-307.el7.1.x86_64 nvidia-driver-latest-cuda-libs-450.36.06-1.el7.x86_64
(gdb) backtrace
No stack.
related: How to fix a SIGSEGV in pytorch when using distributed training (e.g. DDP)? and more details there of what I’ve done.
ps: I also noticed this happens when rank 0 ends before rank 1 (that helped me reproduce it, otherwise a sigabort SIGABRT happens).
# clean up distributed code
torch.distributed.barrier()
if rank == 1:
time.sleep(1)
print(f'\n----> about to cleanup worker with rank {rank}')
# cleanup(rank)
torch.distributed.destroy_process_group()
print(f'clean up done successfully! {rank}'