Distributed training prints log messages twice

kirk86 · September 13, 2019, 3:56pm

How can I prevent my log messages to not get printed multiple times when I use distributed training?

Any ideas how to resolve this or where to took at?

Also, I keep getting an error every time I set OMP_NUM_THREADS > 1

Any thoughts into what might have gone wrong here?

kirk86 · September 13, 2019, 10:02pm

I managed to solve the error I was getting when using OMP_NUM_THREADS > 1
Basically looking in my script I had to add init_method="env://" in the call to the process_group

torch.distributed.init_process_group(backend='nccl', init_method='env://')

The other thing that I was missing is that when calling the launch utility you have to pass a random port otherwise I would get the above error

python -m torch.distributed.launch --nproc_per_node=number of gpus --master_port=some random high number port main.py