Hi.
I’m using pytorch 1.8.1 version with horovod.
I wanted to profile my code but when it reached batch to profile, it gives the following error.
(omitted…)
[1,0]:[node07:11736] *** Process received signal *** | |||||||
---|---|---|---|---|---|---|---|
[1,0]:[node07:11736] Signal: Segmentation fault (11) | |||||||
[1,0]:[node07:11736] Signal code: (-6) | |||||||
[1,0]:[node07:11736] Failing at address: 0x44e00002dd8 | |||||||
[1,0]:[node07:11736] [ 0] [1,0]:/lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7fe67d5b6390] | |||||||
[1,0]:[node07:11736] [ 1] [1,0]:/home/name/.conda/envs/horovod2/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so(+0xe0693fe)[0x7fe615e523fe] | |||||||
[1,0]:[node07:11736] [ 2] [1,0]:/home/name/.conda/envs/horovod2/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so(+0xe06c808)[0x7fe615e55808] | |||||||
[1,0]:[node07:11736] [ 3] [1,0]:/home/name/.conda/envs/horovod2/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so(+0xe214858)[0x7fe615ffd858] | |||||||
[1,0]:[node07:11736] [ 4] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7fe67d5ac6ba] | |||||||
[1,0]:[node07:11736] [ 5] [1,0]:/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fe67d2e251d] | |||||||
[1,0]:[node07:11736] *** End of error message *** | |||||||
-------------------------------------------------------------------------- | |||||||
Primary job terminated normally, but 1 process returned | |||||||
a non-zero exit code. Per user-direction, the job has been aborted. | |||||||
-------------------------------------------------------------------------- | |||||||
-------------------------------------------------------------------------- | |||||||
mpirun noticed that process rank 1 with PID 0 on node node07 exited on signal 11 (Segmentation fault). |
Moreover, .json file is not created even if above error does not appeard…
Please let me know how to solve this problem.
thank you.