Errors in GLOO backend in init_process_group after system updates

churchillmic · April 12, 2019, 10:53pm

My Pytorch distributed code (similar to ImageNet example, but without multiprocessing) was working on distributed nodes, using the GLOO backend (Python 3.7, Pytorch 1.0.1). There was some system updates, and now all of a sudden I have intermittent success with the code, with the init_process_group throwing various errors often. I can usually get it work on 2 nodes (4 GPU’s each node), but when I try > 2 compute nodes, almost always there is an error thrown in the init_process_group call (see below for some of the errors).
I put “python -X faulthandler”, and it tells me for the Segmentation faults, that the error is in ProcessGroupGloo in distributed_c10d.py (line 360).
I’m a bit at a loss of how to debug this, and what to check. I have reinstalled the Pytorch packages after the system updates, and tried going to Python 3.6 also, but no luck. I haven’t tried compiling from source yet. I started looking at the GLOO repo, and saw there are some tests, not sure if they would help pinpoint the cause.

Error #1:

srun: error: tiger-i21g6: task 0: Segmentation fault

Error #2:

Traceback (most recent call last):
File “disruptcnn/main.py”, line 595, in <module>
main()
File “disruptcnn/main.py”, line 161, in main
main_worker(args.gpu, ngpus_per_node, args)
File “disruptcnn/main.py”, line 177, in main_worker
world_size=args.world_size, rank=args.rank)
File “~/.conda/envs/python3a/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”, line 360, in init_process_group
timeout=timeout)
RuntimeError: read: Bad address
terminate called after throwing an instance of 'std::system_error’
what(): read: Bad address

Error #3:

*** Error in `~/.conda/envs/python36/bin/python’: free(): invalid next size (fast): 0x000055b8f75ea9b0 ***

pietern · April 15, 2019, 9:34pm

Hi!

Can you share how you’re calling init_process_group, which initialization method you’re using, etc? If you get the first error on one machine and the second error on another, it is possible that the second error is caused by the first process crashing.

I’m asking because the read: Bad address error makes me think there is something going on with the TCP store. If you’re using the TCP initialization method, and try to use an IP that is no longer valid, for example, this is the type of error that could happen.

churchillmic · April 16, 2019, 2:33am

I’m using the file method, on a parallel file system:

    jobid = os.environ['SLURM_JOB_ID']
    world_size = int(os.environ['SLURM_NTASKS'])
    rank = int(os.environ['SLURM_PROCID'])
    dist.init_process_group(backend='gloo', init_method='file:///scratch/gpfs/me/main_'+jobid+'.txt',
                            world_size=world_size, rank=rank)

The errors don’t occur one on each machine, rather if I try running this various times, one of those errors will be thrown (but not both at the same time).

pietern · April 16, 2019, 10:57pm

When you run this multiple times, do those runs use the same SLURM_JOB_ID? PyTorch makes an attempt to remove the file that is used for initialization at exit, but if any of the processes crashes it may stick around, and cause problems. You can “fix” this by force removing the file before starting a new run.

churchillmic · April 16, 2019, 11:03pm

No, the SLURM system gives you a unique SLURM_JOB_ID for each run that you do (which is why I’m using it, to ensure the file is unique for each run).

churchillmic · April 16, 2019, 11:06pm

I noticed the note on fcntl, is there some test I should run on the parallel file system to ensure there are no issues with correct locking? I think GPFS should be fine, but perhaps theres some edge case, and a newer driver or something caused things to mess up.

I was able to try out NCCL, and this appears to be working for the > 2 nodes runs, so its not as urgent, but I’d still be interested in figuring this out

pietern · April 17, 2019, 3:07am

Thanks, that rules out clobbering the same file from multiple runs.

There is. We have a test for the file store that’s built by default if you compile from source and will be located at build/bin/FileStoreTest. This test automatically creates some files in TMPDIR, which you can override yourself to force it to use the GPFS path. This doesn’t fully simulate the scenario you have with multiple machines, but at least hammers the file system with multiple processes from the same machine. It could uncover something, so it’s definitely worth a try.

The use of this store when using the NCCL backend is very light. Only a single process writes to the file and all others read from it. When using the Gloo backend, everybody both writes to and reads from the file, causing a lot more contention.

Stone · April 22, 2019, 5:43pm

Hi there,

I’m using GPFS filesystem for file init, and I also got the read(): bad address problem. When I change the file location to local /tmp, it’s fine.

FYI: the mandatory file lock of Gluster and some known issues https://docs.gluster.org/en/v3/Administrator%20Guide/Mandatory%20Locks/