Segmentation fault

@ptrblck this is a slightly different issue than our previous conversations…I am running a simple training script with DDP. I tried what you suggested but the output of gdb is empty…literally that’s what gdb says. I am a bit lost

----> about to cleanup worker with rank 0
clean up done successfully! 0
Traceback (most recent call last):
  File "ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 338, in <module>
    main_distributed()
  File "ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 230, in main_distributed
    spawn_return = mp.spawn(fn=train, args=(opts,), nprocs=opts.world_size)
  File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 105, in join
    raise Exception(
Exception: process 0 terminated with signal SIGSEGV
[Thread 0x2aaaaaaf0ec0 (LWP 25285) exited]
[Inferior 1 (process 25285) exited with code 01]
Missing separate debuginfos, use: debuginfo-install glibc-2.17-307.el7.1.x86_64 nvidia-driver-latest-cuda-libs-450.36.06-1.el7.x86_64
(gdb) backtrace
No stack.

related: How to fix a SIGSEGV in pytorch when using distributed training (e.g. DDP)? and more details there of what I’ve done.

ps: I also noticed this happens when rank 0 ends before rank 1 (that helped me reproduce it, otherwise a sigabort SIGABRT happens).

    # clean up distributed code
    torch.distributed.barrier()
    if rank == 1:
        time.sleep(1)
    print(f'\n----> about to cleanup worker with rank {rank}')
    # cleanup(rank)
    torch.distributed.destroy_process_group()
    print(f'clean up done successfully! {rank}'
1 Like