Aot_eager mode segfault during bert training

Hi Community, I am using this code to tryout bert training.
Initially I’ve used default inductor as backend, it emits segment fault after having around 260 kernels.


magma-cuda117 2.6.1 1 pytorch
pytorch-cuda 11.7 h778d358_3 pytorch-nightly
torch 2.1.0a0+gitbbfd5e5 pypi_0 pypi
triton 2.0.0 dev_0

Then I switch to aot_eager it still segment faulted after 618 kernels.

with lldb – python, I got the following backtrace (I believe with this debug the script fail at a very early stage than before):

* thread #53, name = 'python', stop reason = signal SIGBUS: illegal address
  * frame #0: 0x00007ffff71acded`__memset_avx2_erms at memset-vec-unaligned-erms.S:145
    frame #1: 0x00007fff606c0200`::ncclShmSetup(shmPath="/dev/shm/nccl-Ez4TXV", shmSize=4096, fd=0x00007fff2a33dba0, ptr=0x00007fff2a33dbb0, create=1) at
    frame #2: 0x00007fff606c0264`ncclShmOpen(shmPath="/dev/shm/nccl-Ez4TXV", shmSize=4096, shmPtr=0x00007ffd3c04df00, devShmPtr=0x00007ffd3c04df08, create=1) at
    frame #3: 0x00007fff6074045b`::shmSendSetup(comm=0x000000009bbbd560, graph=0x00007fff2a3411d0, myInfo=0x00007ffd3c00c0d0, peerInfo=0x00007ffd3c00c090, connectInfo=0x00007fff2a33ef20, send=0x00007ffd3c0385b8, channelId=1, connIndex=0) at
    frame #4: 0x00007fff606d44bc`::selectTransport<1>(comm=0x000000009bbbd560, graph=0x00007fff2a3411d0, connect=0x00007fff2a33ef20, channelId=1, peer=0, connIndex=0, transportType=0x00007fff2a33ed04) at
    frame #5: 0x00007fff606d2555`ncclTransportP2pSetup(comm=0x000000009bbbd560, graph=0x00007fff2a3411d0, connIndex=0, highestTransportType=0x0000000000000000) at
    frame #6: 0x00007fff6068ea19`::initTransportsRank(comm=0x000000009bbbd560, commId=0x00007fff2a359e20) at
    frame #7: 0x00007fff60691324`::ncclCommInitRankFunc(job_=0x00000000ad100910) at
    frame #8: 0x00007fff6068448d`ncclAsyncJobMain(arg=0x00000000ad100910) at
    frame #9: 0x00007ffff7bbb6db`start_thread(arg=0x00007fff2a35a700) at pthread_create.c:463
    frame #10: 0x00007ffff713f61f`__clone at clone.S:95

At first sight this might be related to seting up shared memory for nccl, doesn’t even seem related to pytorch. However I think during the run without lldb, the segfault happens at a much later stage.
I want to know what is the suggested next step on debuging this.

I am facing same problem, when i run inference by libtorch API. And have you resolve this problem?