Aot_eager mode segfault during bert training

Hi Community, I am using this code to tryout bert training.
Initially I’ve used default inductor as backend, it emits segment fault after having around 260 kernels.

myenv

magma-cuda117 2.6.1 1 pytorch
pytorch-cuda 11.7 h778d358_3 pytorch-nightly
torch 2.1.0a0+gitbbfd5e5 pypi_0 pypi
triton 2.0.0 dev_0

Then I switch to aot_eager it still segment faulted after 618 kernels.

with lldb – python run.py, I got the following backtrace (I believe with this debug the script fail at a very early stage than before):

* thread #53, name = 'python', stop reason = signal SIGBUS: illegal address
  * frame #0: 0x00007ffff71acded libc.so.6`__memset_avx2_erms at memset-vec-unaligned-erms.S:145
    frame #1: 0x00007fff606c0200 libtorch_cuda.so`::ncclShmSetup(shmPath="/dev/shm/nccl-Ez4TXV", shmSize=4096, fd=0x00007fff2a33dba0, ptr=0x00007fff2a33dbb0, create=1) at shmutils.cc:52:21
    frame #2: 0x00007fff606c0264 libtorch_cuda.so`ncclShmOpen(shmPath="/dev/shm/nccl-Ez4TXV", shmSize=4096, shmPtr=0x00007ffd3c04df00, devShmPtr=0x00007ffd3c04df08, create=1) at shmutils.cc:61:3
    frame #3: 0x00007fff6074045b libtorch_cuda.so`::shmSendSetup(comm=0x000000009bbbd560, graph=0x00007fff2a3411d0, myInfo=0x00007ffd3c00c0d0, peerInfo=0x00007ffd3c00c090, connectInfo=0x00007fff2a33ef20, send=0x00007ffd3c0385b8, channelId=1, connIndex=0) at shm.cc:87:3
    frame #4: 0x00007fff606d44bc libtorch_cuda.so`::selectTransport<1>(comm=0x000000009bbbd560, graph=0x00007fff2a3411d0, connect=0x00007fff2a33ef20, channelId=1, peer=0, connIndex=0, transportType=0x00007fff2a33ed04) at transport.cc:33:7
    frame #5: 0x00007fff606d2555 libtorch_cuda.so`ncclTransportP2pSetup(comm=0x000000009bbbd560, graph=0x00007fff2a3411d0, connIndex=0, highestTransportType=0x0000000000000000) at transport.cc:98:9
    frame #6: 0x00007fff6068ea19 libtorch_cuda.so`::initTransportsRank(comm=0x000000009bbbd560, commId=0x00007fff2a359e20) at init.cc:790:3
    frame #7: 0x00007fff60691324 libtorch_cuda.so`::ncclCommInitRankFunc(job_=0x00000000ad100910) at init.cc:1089:3
    frame #8: 0x00007fff6068448d libtorch_cuda.so`ncclAsyncJobMain(arg=0x00000000ad100910) at group.cc:62:26
    frame #9: 0x00007ffff7bbb6db libpthread.so.0`start_thread(arg=0x00007fff2a35a700) at pthread_create.c:463
    frame #10: 0x00007ffff713f61f libc.so.6`__clone at clone.S:95

At first sight this might be related to seting up shared memory for nccl, doesn’t even seem related to pytorch. However I think during the run without lldb, the segfault happens at a much later stage.
I want to know what is the suggested next step on debuging this.