Stack smash error distributed training on H100s

Hi, I’m trying to evaluate distributed training on H100 instances using torchrun minimal example. training works fine on single machine multiple GPU mode but facing stack smash error during multi-node training.
Pytorch version

2.1.0a0+b5021ba

Command to launch

torchrun --nnodes=2 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:29400 elastic_ddp.py

stack trace

torchrun --nnodes=2 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:29400 elastic_ddp.py
[2023-08-27 15:43:33,056] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
[2023-08-27 15:43:33,056] torch.distributed.run: [WARNING]
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
eth0
Start running basic DDP example on rank 0.
Start running basic DDP example on rank 3.
Start running basic DDP example on rank 6.
Start running basic DDP example on rank 4.
Start running basic DDP example on rank 1.
Start running basic DDP example on rank 2.
Start running basic DDP example on rank 5.
Start running basic DDP example on rank 7.
ip-10-43-1-202:26211:26211 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
ip-10-43-1-202:26211:26211 [0] NCCL INFO Bootstrap : Using eth0:10.43.1.202<0>
ip-10-43-1-202:26211:26211 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
ip-10-43-1-202:26211:26211 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
ip-10-43-1-202:26211:26211 [0] NCCL INFO cudaDriverVersion 12000
NCCL version 2.18.3+cuda12.1
ip-10-43-1-202:26212:26212 [1] NCCL INFO cudaDriverVersion 12000
ip-10-43-1-202:26212:26212 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
ip-10-43-1-202:26212:26212 [1] NCCL INFO Bootstrap : Using eth0:10.43.1.202<0>
ip-10-43-1-202:26212:26212 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
ip-10-43-1-202:26212:26212 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
ip-10-43-1-202:26217:26217 [6] NCCL INFO cudaDriverVersion 12000
ip-10-43-1-202:26217:26217 [6] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
ip-10-43-1-202:26217:26217 [6] NCCL INFO Bootstrap : Using eth0:10.43.1.202<0>
ip-10-43-1-202:26217:26217 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
ip-10-43-1-202:26217:26217 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
ip-10-43-1-202:26213:26213 [2] NCCL INFO cudaDriverVersion 12000
ip-10-43-1-202:26213:26213 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
ip-10-43-1-202:26213:26213 [2] NCCL INFO Bootstrap : Using eth0:10.43.1.202<0>
ip-10-43-1-202:26213:26213 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
ip-10-43-1-202:26213:26213 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
ip-10-43-1-202:26218:26218 [7] NCCL INFO cudaDriverVersion 12000
ip-10-43-1-202:26218:26218 [7] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
ip-10-43-1-202:26214:26214 [3] NCCL INFO cudaDriverVersion 12000
ip-10-43-1-202:26214:26214 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
ip-10-43-1-202:26218:26218 [7] NCCL INFO Bootstrap : Using eth0:10.43.1.202<0>
ip-10-43-1-202:26218:26218 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
ip-10-43-1-202:26218:26218 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
ip-10-43-1-202:26214:26214 [3] NCCL INFO Bootstrap : Using eth0:10.43.1.202<0>
ip-10-43-1-202:26214:26214 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
ip-10-43-1-202:26214:26214 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
ip-10-43-1-202:26215:26215 [4] NCCL INFO cudaDriverVersion 12000
ip-10-43-1-202:26215:26215 [4] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
ip-10-43-1-202:26215:26215 [4] NCCL INFO Bootstrap : Using eth0:10.43.1.202<0>
ip-10-43-1-202:26215:26215 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
ip-10-43-1-202:26215:26215 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
ip-10-43-1-202:26216:26216 [5] NCCL INFO cudaDriverVersion 12000
ip-10-43-1-202:26216:26216 [5] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
ip-10-43-1-202:26216:26216 [5] NCCL INFO Bootstrap : Using eth0:10.43.1.202<0>
ip-10-43-1-202:26216:26216 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
ip-10-43-1-202:26216:26216 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
ip-10-43-1-202:26212:26350 [1] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
ip-10-43-1-202:26212:26350 [1] NCCL INFO NET/OFI Configuring AWS-specific options
ip-10-43-1-202:26212:26350 [1] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
ip-10-43-1-202:26212:26350 [1] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-43-1-202:26212:26350 [1] NCCL INFO NET/OFI Selected Provider is efa (found 32 nics)
ip-10-43-1-202:26212:26350 [1] NCCL INFO Using network AWS Libfabric
ip-10-43-1-202:26211:26348 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
ip-10-43-1-202:26211:26348 [0] NCCL INFO NET/OFI Configuring AWS-specific options
ip-10-43-1-202:26211:26348 [0] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
ip-10-43-1-202:26211:26348 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-43-1-202:26211:26348 [0] NCCL INFO NET/OFI Selected Provider is efa (found 32 nics)
ip-10-43-1-202:26211:26348 [0] NCCL INFO Using network AWS Libfabric
ip-10-43-1-202:26215:26354 [4] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
ip-10-43-1-202:26215:26354 [4] NCCL INFO NET/OFI Configuring AWS-specific options
ip-10-43-1-202:26215:26354 [4] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
ip-10-43-1-202:26215:26354 [4] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-43-1-202:26215:26354 [4] NCCL INFO NET/OFI Selected Provider is efa (found 32 nics)
ip-10-43-1-202:26215:26354 [4] NCCL INFO Using network AWS Libfabric
ip-10-43-1-202:26217:26349 [6] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
ip-10-43-1-202:26217:26349 [6] NCCL INFO NET/OFI Configuring AWS-specific options
ip-10-43-1-202:26217:26349 [6] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
ip-10-43-1-202:26217:26349 [6] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-43-1-202:26217:26349 [6] NCCL INFO NET/OFI Selected Provider is efa (found 32 nics)
ip-10-43-1-202:26217:26349 [6] NCCL INFO Using network AWS Libfabric
ip-10-43-1-202:26213:26351 [2] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
ip-10-43-1-202:26213:26351 [2] NCCL INFO NET/OFI Configuring AWS-specific options
ip-10-43-1-202:26213:26351 [2] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
ip-10-43-1-202:26213:26351 [2] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-43-1-202:26214:26353 [3] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
ip-10-43-1-202:26214:26353 [3] NCCL INFO NET/OFI Configuring AWS-specific options
ip-10-43-1-202:26214:26353 [3] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
ip-10-43-1-202:26214:26353 [3] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-43-1-202:26213:26351 [2] NCCL INFO NET/OFI Selected Provider is efa (found 32 nics)
ip-10-43-1-202:26213:26351 [2] NCCL INFO Using network AWS Libfabric
ip-10-43-1-202:26214:26353 [3] NCCL INFO NET/OFI Selected Provider is efa (found 32 nics)
ip-10-43-1-202:26214:26353 [3] NCCL INFO Using network AWS Libfabric
ip-10-43-1-202:26216:26355 [5] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
ip-10-43-1-202:26216:26355 [5] NCCL INFO NET/OFI Configuring AWS-specific options
ip-10-43-1-202:26216:26355 [5] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
ip-10-43-1-202:26216:26355 [5] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-43-1-202:26216:26355 [5] NCCL INFO NET/OFI Selected Provider is efa (found 32 nics)
ip-10-43-1-202:26216:26355 [5] NCCL INFO Using network AWS Libfabric
ip-10-43-1-202:26218:26352 [7] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
ip-10-43-1-202:26218:26352 [7] NCCL INFO NET/OFI Configuring AWS-specific options
ip-10-43-1-202:26218:26352 [7] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
ip-10-43-1-202:26218:26352 [7] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-43-1-202:26218:26352 [7] NCCL INFO NET/OFI Selected Provider is efa (found 32 nics)
ip-10-43-1-202:26218:26352 [7] NCCL INFO Using network AWS Libfabric
ip-10-43-1-202:26216:26355 [5] NCCL INFO comm 0x55a52bdd6b70 rank 5 nranks 16 cudaDev 5 nvmlDev 5 busId a8000 commId 0xa71ae6e97159067a - Init START
ip-10-43-1-202:26213:26351 [2] NCCL INFO comm 0x557382127550 rank 2 nranks 16 cudaDev 2 nvmlDev 2 busId 75000 commId 0xa71ae6e97159067a - Init START
ip-10-43-1-202:26214:26353 [3] NCCL INFO comm 0x5596720e2390 rank 3 nranks 16 cudaDev 3 nvmlDev 3 busId 86000 commId 0xa71ae6e97159067a - Init START
ip-10-43-1-202:26218:26352 [7] NCCL INFO comm 0x55c73a7cd790 rank 7 nranks 16 cudaDev 7 nvmlDev 7 busId ca000 commId 0xa71ae6e97159067a - Init START
ip-10-43-1-202:26212:26350 [1] NCCL INFO comm 0x56270d7ef760 rank 1 nranks 16 cudaDev 1 nvmlDev 1 busId 64000 commId 0xa71ae6e97159067a - Init START
ip-10-43-1-202:26211:26348 [0] NCCL INFO comm 0x55c6dceb7ca0 rank 0 nranks 16 cudaDev 0 nvmlDev 0 busId 53000 commId 0xa71ae6e97159067a - Init START
ip-10-43-1-202:26215:26354 [4] NCCL INFO comm 0x563c3517b750 rank 4 nranks 16 cudaDev 4 nvmlDev 4 busId 97000 commId 0xa71ae6e97159067a - Init START
ip-10-43-1-202:26217:26349 [6] NCCL INFO comm 0x563fb0d40c70 rank 6 nranks 16 cudaDev 6 nvmlDev 6 busId b9000 commId 0xa71ae6e97159067a - Init START
*** stack smashing detected ***: terminated
*** stack smashing detected ***: terminated
*** stack smashing detected ***: terminated
*** stack smashing detected ***: terminated
*** stack smashing detected ***: terminated
*** stack smashing detected ***: terminated
*** stack smashing detected ***: terminated
*** stack smashing detected ***: terminated
[2023-08-27 15:44:39,390] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 26212 closing signal SIGTERM
[2023-08-27 15:44:39,390] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 26214 closing signal SIGTERM
[2023-08-27 15:44:39,390] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 26217 closing signal SIGTERM
[2023-08-27 15:44:39,390] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 26218 closing signal SIGTERM
[2023-08-27 15:44:41,938] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 26211) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.0a0+b5021ba', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
elastic_ddp.py FAILED
-------------------------------------------------------
Failures:
[1]:
  time      : 2023-08-27_15:44:39
  host      : ip-10-43-1-202.us-west-2.compute.internal
  rank      : 2 (local_rank: 2)
  exitcode  : -6 (pid: 26213)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 26213
[2]:
  time      : 2023-08-27_15:44:39
  host      : ip-10-43-1-202.us-west-2.compute.internal
  rank      : 4 (local_rank: 4)
  exitcode  : -6 (pid: 26215)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 26215
[3]:
  time      : 2023-08-27_15:44:39
  host      : ip-10-43-1-202.us-west-2.compute.internal
  rank      : 5 (local_rank: 5)
  exitcode  : -6 (pid: 26216)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 26216
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-27_15:44:39
  host      : ip-10-43-1-202.us-west-2.compute.internal
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 26211)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 26211
=======================================================
marri@ip-10-43-1-202:/sensei-fs/users/marri/dist-training$ export NCCL_SOCKET_IFNAME=
marri@ip-10-43-1-202:/sensei-fs/users/marri/dist-training$ torchrun --nnodes=2 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=10.43.1.202:29400 elastic_ddp.py
[2023-08-27 15:45:10,130] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
[2023-08-27 15:45:10,130] torch.distributed.run: [WARNING]
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************








Start running basic DDP example on rank 0.
Start running basic DDP example on rank 2.
Start running basic DDP example on rank 1.
Start running basic DDP example on rank 3.
Start running basic DDP example on rank 4.
Start running basic DDP example on rank 5.
Start running basic DDP example on rank 6.
Start running basic DDP example on rank 7.
ip-10-43-1-202:26431:26431 [0] NCCL INFO Bootstrap : Using eth0:10.43.1.202<0>
ip-10-43-1-202:26431:26431 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
ip-10-43-1-202:26431:26431 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
ip-10-43-1-202:26431:26431 [0] NCCL INFO cudaDriverVersion 12000
NCCL version 2.18.3+cuda12.1
ip-10-43-1-202:26433:26433 [2] NCCL INFO cudaDriverVersion 12000
ip-10-43-1-202:26433:26433 [2] NCCL INFO Bootstrap : Using eth0:10.43.1.202<0>
ip-10-43-1-202:26433:26433 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
ip-10-43-1-202:26433:26433 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
ip-10-43-1-202:26435:26435 [4] NCCL INFO cudaDriverVersion 12000
ip-10-43-1-202:26435:26435 [4] NCCL INFO Bootstrap : Using eth0:10.43.1.202<0>
ip-10-43-1-202:26435:26435 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
ip-10-43-1-202:26435:26435 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
ip-10-43-1-202:26437:26437 [6] NCCL INFO cudaDriverVersion 12000
ip-10-43-1-202:26432:26432 [1] NCCL INFO cudaDriverVersion 12000
ip-10-43-1-202:26437:26437 [6] NCCL INFO Bootstrap : Using eth0:10.43.1.202<0>
ip-10-43-1-202:26437:26437 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
ip-10-43-1-202:26437:26437 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
ip-10-43-1-202:26432:26432 [1] NCCL INFO Bootstrap : Using eth0:10.43.1.202<0>
ip-10-43-1-202:26432:26432 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
ip-10-43-1-202:26432:26432 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
ip-10-43-1-202:26438:26438 [7] NCCL INFO cudaDriverVersion 12000
ip-10-43-1-202:26438:26438 [7] NCCL INFO Bootstrap : Using eth0:10.43.1.202<0>
ip-10-43-1-202:26438:26438 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
ip-10-43-1-202:26438:26438 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
ip-10-43-1-202:26434:26434 [3] NCCL INFO cudaDriverVersion 12000
ip-10-43-1-202:26436:26436 [5] NCCL INFO cudaDriverVersion 12000
ip-10-43-1-202:26434:26434 [3] NCCL INFO Bootstrap : Using eth0:10.43.1.202<0>
ip-10-43-1-202:26434:26434 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
ip-10-43-1-202:26434:26434 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
ip-10-43-1-202:26436:26436 [5] NCCL INFO Bootstrap : Using eth0:10.43.1.202<0>
ip-10-43-1-202:26436:26436 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
ip-10-43-1-202:26436:26436 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
ip-10-43-1-202:26431:26568 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
ip-10-43-1-202:26431:26568 [0] NCCL INFO NET/OFI Configuring AWS-specific options
ip-10-43-1-202:26431:26568 [0] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
ip-10-43-1-202:26431:26568 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-43-1-202:26431:26568 [0] NCCL INFO NET/OFI Selected Provider is efa (found 32 nics)
ip-10-43-1-202:26431:26568 [0] NCCL INFO Using network AWS Libfabric
ip-10-43-1-202:26437:26571 [6] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
ip-10-43-1-202:26437:26571 [6] NCCL INFO NET/OFI Configuring AWS-specific options
ip-10-43-1-202:26437:26571 [6] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
ip-10-43-1-202:26437:26571 [6] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-43-1-202:26437:26571 [6] NCCL INFO NET/OFI Selected Provider is efa (found 32 nics)
ip-10-43-1-202:26437:26571 [6] NCCL INFO Using network AWS Libfabric
ip-10-43-1-202:26435:26570 [4] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
ip-10-43-1-202:26435:26570 [4] NCCL INFO NET/OFI Configuring AWS-specific options
ip-10-43-1-202:26435:26570 [4] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
ip-10-43-1-202:26435:26570 [4] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-43-1-202:26435:26570 [4] NCCL INFO NET/OFI Selected Provider is efa (found 32 nics)
ip-10-43-1-202:26435:26570 [4] NCCL INFO Using network AWS Libfabric
ip-10-43-1-202:26433:26569 [2] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
ip-10-43-1-202:26433:26569 [2] NCCL INFO NET/OFI Configuring AWS-specific options
ip-10-43-1-202:26433:26569 [2] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
ip-10-43-1-202:26433:26569 [2] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-43-1-202:26433:26569 [2] NCCL INFO NET/OFI Selected Provider is efa (found 32 nics)
ip-10-43-1-202:26433:26569 [2] NCCL INFO Using network AWS Libfabric
ip-10-43-1-202:26432:26572 [1] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
ip-10-43-1-202:26432:26572 [1] NCCL INFO NET/OFI Configuring AWS-specific options
ip-10-43-1-202:26432:26572 [1] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
ip-10-43-1-202:26432:26572 [1] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-43-1-202:26432:26572 [1] NCCL INFO NET/OFI Selected Provider is efa (found 32 nics)
ip-10-43-1-202:26432:26572 [1] NCCL INFO Using network AWS Libfabric
ip-10-43-1-202:26438:26573 [7] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
ip-10-43-1-202:26438:26573 [7] NCCL INFO NET/OFI Configuring AWS-specific options
ip-10-43-1-202:26438:26573 [7] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
ip-10-43-1-202:26438:26573 [7] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-43-1-202:26438:26573 [7] NCCL INFO NET/OFI Selected Provider is efa (found 32 nics)
ip-10-43-1-202:26438:26573 [7] NCCL INFO Using network AWS Libfabric
ip-10-43-1-202:26436:26575 [5] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
ip-10-43-1-202:26436:26575 [5] NCCL INFO NET/OFI Configuring AWS-specific options
ip-10-43-1-202:26436:26575 [5] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
ip-10-43-1-202:26436:26575 [5] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-43-1-202:26436:26575 [5] NCCL INFO NET/OFI Selected Provider is efa (found 32 nics)
ip-10-43-1-202:26436:26575 [5] NCCL INFO Using network AWS Libfabric
ip-10-43-1-202:26434:26574 [3] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
ip-10-43-1-202:26434:26574 [3] NCCL INFO NET/OFI Configuring AWS-specific options
ip-10-43-1-202:26434:26574 [3] NCCL INFO NET/OFI Setting NCCL_PROTO to "simple"
ip-10-43-1-202:26434:26574 [3] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-43-1-202:26434:26574 [3] NCCL INFO NET/OFI Selected Provider is efa (found 32 nics)
ip-10-43-1-202:26434:26574 [3] NCCL INFO Using network AWS Libfabric
ip-10-43-1-202:26431:26568 [0] NCCL INFO comm 0x56232b40dc00 rank 0 nranks 16 cudaDev 0 nvmlDev 0 busId 53000 commId 0x2f13137ec4e46d95 - Init START
ip-10-43-1-202:26434:26574 [3] NCCL INFO comm 0x5602401fb490 rank 3 nranks 16 cudaDev 3 nvmlDev 3 busId 86000 commId 0x2f13137ec4e46d95 - Init START
ip-10-43-1-202:26432:26572 [1] NCCL INFO comm 0x55e7ffe52540 rank 1 nranks 16 cudaDev 1 nvmlDev 1 busId 64000 commId 0x2f13137ec4e46d95 - Init START
ip-10-43-1-202:26436:26575 [5] NCCL INFO comm 0x56141cc43690 rank 5 nranks 16 cudaDev 5 nvmlDev 5 busId a8000 commId 0x2f13137ec4e46d95 - Init START
ip-10-43-1-202:26433:26569 [2] NCCL INFO comm 0x55df6deb11b0 rank 2 nranks 16 cudaDev 2 nvmlDev 2 busId 75000 commId 0x2f13137ec4e46d95 - Init START
ip-10-43-1-202:26438:26573 [7] NCCL INFO comm 0x564bc6e72b50 rank 7 nranks 16 cudaDev 7 nvmlDev 7 busId ca000 commId 0x2f13137ec4e46d95 - Init START
ip-10-43-1-202:26435:26570 [4] NCCL INFO comm 0x5582175efab0 rank 4 nranks 16 cudaDev 4 nvmlDev 4 busId 97000 commId 0x2f13137ec4e46d95 - Init START
ip-10-43-1-202:26437:26571 [6] NCCL INFO comm 0x5581ba7025b0 rank 6 nranks 16 cudaDev 6 nvmlDev 6 busId b9000 commId 0x2f13137ec4e46d95 - Init START
*** stack smashing detected ***: terminated
*** stack smashing detected ***: terminated
*** stack smashing detected ***: terminated
*** stack smashing detected ***: terminated
*** stack smashing detected ***: terminated
*** stack smashing detected ***: terminated
*** stack smashing detected ***: terminated
*** stack smashing detected ***: terminated
[2023-08-27 15:46:11,401] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 26435 closing signal SIGTERM
[2023-08-27 15:46:12,720] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 26431) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.0a0+b5021ba', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Are you seeing the same error using any of the pre-built binaries or only using your custom source build?

@ptrblck
This version of pytorch works fine for multi-node training on A100s. But seeing stack smash error only on H100s.
I’m not sure if version can be an issue since it already works on A100s. Reason to use this pytorch version is we want to use FP8 for training. I don’t think we have support for FP8 in lower versions.