Segment errors always occur when doing back propagation

khalil · August 9, 2020, 6:56am

I compile the source code in the aarch64 machine. When I use Pytorch for distributed training, there are always segment errors, just like this：

Traceback (most recent call last):
  File "/GPUFS/nsccgz_xliao_lds/local/python3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/GPUFS/nsccgz_xliao_lds/local/python3.7/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/GPUFS/nsccgz_xliao_lds/local/torch/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/GPUFS/nsccgz_xliao_lds/local/torch/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/GPUFS/nsccgz_xliao_lds/local/python3.7/bin/python3', '-u', '/GPUFS/nsccgz_xliao_lds/deepnet_mpi/CosmoFlow.py', '--local_rank=0', '--epochs=120', '--backend=gloo', '--workers=0', '--batch-size=1', '--print-freq=50', '--data=/GPUFS/nsccgz_xliao_lds/Nbody/datasets/v6']' died with <Signals.SIGSEGV: 11>.

I make a simple test and then I found the error occurs during back propagation.

loss.sum.backward()

collect_env:

Collecting environment information...
PyTorch version: 1.5.0a0+4ff3872
Is debug build: No
CUDA used to build PyTorch: None

OS: CentOS Linux release 7.6.1810 (AltArch)
GCC version: (GCC) 9.2.0
CMake version: version 2.8.12.2

Python version: 3.7
Is CUDA available: No
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA

Versions of relevant libraries:
[pip3] numpy==1.19.1
[pip3] torch==1.5.0a0+4ff3872
[conda] Could not collect

uname -a

Linux aln220 4.14.0-115.el7a.0.1.aarch64 #1 SMP Sun Nov 25 20:54:21 UTC 2018 aarch64 aarch64 aarch64 GNU/Linux

And I use the OpenBLAS as the BLAS backend.And I found if I set the environment variable OMP_NUM_THREADS to 1, this error will not occur.

albanD · August 10, 2020, 8:30pm

Hi,

Can you give a python stack trace of where it happens? in particular, how do you use distributed: via DDP or rpc? Do you use the distributed optimizer?

Also can you use gdb to see where the segfault comes from?

khalil · August 11, 2020, 1:39am

Sorry to reply you so late, I recompile the OpenBLAS and now the problem is solved.Thanks for your reply