I compile the source code in the aarch64 machine. When I use Pytorch for distributed training, there are always segment errors, just like this:
Traceback (most recent call last):
File "/GPUFS/nsccgz_xliao_lds/local/python3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/GPUFS/nsccgz_xliao_lds/local/python3.7/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/GPUFS/nsccgz_xliao_lds/local/torch/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module>
main()
File "/GPUFS/nsccgz_xliao_lds/local/torch/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/GPUFS/nsccgz_xliao_lds/local/python3.7/bin/python3', '-u', '/GPUFS/nsccgz_xliao_lds/deepnet_mpi/CosmoFlow.py', '--local_rank=0', '--epochs=120', '--backend=gloo', '--workers=0', '--batch-size=1', '--print-freq=50', '--data=/GPUFS/nsccgz_xliao_lds/Nbody/datasets/v6']' died with <Signals.SIGSEGV: 11>.
I make a simple test and then I found the error occurs during back propagation.
loss.sum.backward()
collect_env:
Collecting environment information...
PyTorch version: 1.5.0a0+4ff3872
Is debug build: No
CUDA used to build PyTorch: None
OS: CentOS Linux release 7.6.1810 (AltArch)
GCC version: (GCC) 9.2.0
CMake version: version 2.8.12.2
Python version: 3.7
Is CUDA available: No
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Versions of relevant libraries:
[pip3] numpy==1.19.1
[pip3] torch==1.5.0a0+4ff3872
[conda] Could not collect
uname -a
Linux aln220 4.14.0-115.el7a.0.1.aarch64 #1 SMP Sun Nov 25 20:54:21 UTC 2018 aarch64 aarch64 aarch64 GNU/Linux
And I use the OpenBLAS as the BLAS backend.And I found if I set the environment variable OMP_NUM_THREADS to 1, this error will not occur.