Torch.multiprocess.spawn with join=False gives the following error

ayushm-agrawal · April 14, 2020, 1:35am

Hello all,
I am trying to implement Distributed Parallel for my model and I followed the ImageNet example for this. I am pretty new to distributed programming, so I am not sure about a bunch of things. When I use torch.multiprocess.spawn with join=True, there is no output that is printed. When I change the join to False, I get the following error below.

<torch.multiprocessing.spawn.SpawnContext object at 0x2b49eee8a3c8>
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.5/multiprocessing/spawn.py", line 106, in spawn_main
    exitcode = _main(fd)
  File "/usr/lib/python3.5/multiprocessing/spawn.py", line 116, in _main
    self = pickle.load(from_parent)
  File "/usr/lib/python3.5/multiprocessing/synchronize.py", line 111, in __setstate__
    self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.5/multiprocessing/spawn.py", line 106, in spawn_main
    exitcode = _main(fd)
  File "/usr/lib/python3.5/multiprocessing/spawn.py", line 116, in _main
    self = pickle.load(from_parent)
  File "/usr/lib/python3.5/multiprocessing/synchronize.py", line 111, in __setstate__
    self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory

I am submitting this job through a SLURM script that I have put below as well.

#!/bin/sh
#SBATCH --ntasks=4
#SBATCH --time=60:00:00
#SBATCH --partition=gpu
#SBATCH --mem=64gb
#SBATCH --nodes=2
#SBATCH --gres=gpu:2
#SBATCH --constraint=gpu_32gb
#SBATCH --job-name=test
#SBATCH --output=.../out_files/test.out

export PYTHONPATH=$WORK/tf-gpu-pkgs
module load singularity
singularity exec docker://<user>/pytorch-opencv:latest	python3 -u $@ --use_adam=1 --multiprocessing_distributed --benchmarks=0 --benchmark_arch='vgg19' --batch_size=128 --test=0 --transfer=0 --dataset='<dataset-here>'

My code is like the ImageNet example and I am not sure what I am doing wrong.

Thank you,

osalpekar · April 14, 2020, 2:00am

Are you capturing the SpawnContext object returned by the call to torch.multiprocess.spawn? This SpawnContext is returned only when join=False, and must be saved for the spawned processes to coordinate IPC. If you allow the object to be destructed, you will see this error.

Here is a GitHub issue with some more information: https://github.com/pytorch/pytorch/issues/30461

ayushm-agrawal · April 14, 2020, 2:29am

Hello Omkar,
Thank you for replying. The weird issue is that I don’t see the terminated print statement when I use join=True. With the issue that you linked to me, when I spawn the process, shouldn’t I be seeing the print statements from my main_worker function before I hit the terminated print statement? I apologize if this question isn’t framed right. I am new to distributed and don’t understand the system that well.

 ctx = mp.spawn(main_worker, nprocs=ngpus_per_node,
                 args=(ngpus_per_node, args), join=False)
        time.sleep(3)
        print('terminated')
        ctx.join()
    else:
        # Simply call main_worker function
        main_worker(args.gpu, ngpus_per_node, args)


def main_worker(gpu, ngpus_per_node, args):
    global best_acc1
    print(gpu)
    args.gpu = gpu

diff --git a/repro_org.py b/repro.py
index be44c3d..e971db4 100644
--- a/repro_org.py
+++ b/repro.py
@@ -6,7 +6,8 @@ def worker(nproc, arg1, arg2, arg3):
     test = True
 
 if __name__ == '__main__':
-    mp.spawn(worker, (None, None, None), nprocs=1, join=False)
+    ctx = mp.spawn(worker, (None, None, None), nprocs=1, join=False)
     time.sleep(3)
     print('terminated')
+    ctx.join()