Hello all,
I am trying to implement Distributed Parallel for my model and I followed the ImageNet example for this. I am pretty new to distributed programming, so I am not sure about a bunch of things. When I use torch.multiprocess.spawn with join=True, there is no output that is printed. When I change the join to False, I get the following error below.
<torch.multiprocessing.spawn.SpawnContext object at 0x2b49eee8a3c8>
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.5/multiprocessing/spawn.py", line 106, in spawn_main
exitcode = _main(fd)
File "/usr/lib/python3.5/multiprocessing/spawn.py", line 116, in _main
self = pickle.load(from_parent)
File "/usr/lib/python3.5/multiprocessing/synchronize.py", line 111, in __setstate__
self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.5/multiprocessing/spawn.py", line 106, in spawn_main
exitcode = _main(fd)
File "/usr/lib/python3.5/multiprocessing/spawn.py", line 116, in _main
self = pickle.load(from_parent)
File "/usr/lib/python3.5/multiprocessing/synchronize.py", line 111, in __setstate__
self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory
I am submitting this job through a SLURM script that I have put below as well.
#!/bin/sh
#SBATCH --ntasks=4
#SBATCH --time=60:00:00
#SBATCH --partition=gpu
#SBATCH --mem=64gb
#SBATCH --nodes=2
#SBATCH --gres=gpu:2
#SBATCH --constraint=gpu_32gb
#SBATCH --job-name=test
#SBATCH --output=.../out_files/test.out
export PYTHONPATH=$WORK/tf-gpu-pkgs
module load singularity
singularity exec docker://<user>/pytorch-opencv:latest python3 -u $@ --use_adam=1 --multiprocessing_distributed --benchmarks=0 --benchmark_arch='vgg19' --batch_size=128 --test=0 --transfer=0 --dataset='<dataset-here>'
My code is like the ImageNet example and I am not sure what I am doing wrong.
Thank you,