Using conda pytorch. Running my code with python -m torch.distributed.launch --use_env --nproc_per_node 2 on a single node with 2 GPUs.
In my code I took care of the logging so that it is only logged by the main process and it used to work for previous PyTorch versions. However, in PyTorch 1.8.0, logging is done with an additional, default-style, logger, both for the main process and the other(s).
This is how it looks like in PyTorch 1.7.0:
[2021-03-08 21:03:51,597][midaGAN.utils.environment][INFO] - PyTorch version: 1.7.0
[2021-03-08 21:03:51,598][midaGAN.utils.environment][INFO] - CUDA 10.2 - cuDNN 7605
[2021-03-08 21:03:51,598][midaGAN.utils.environment][INFO] - Global rank: 0.
[2021-03-08 21:03:51,598][midaGAN.utils.environment][INFO] - Local rank: 0.
And this is for PyTorch 1.8.0. We can see the same messages of the main process being logged twice, once with the logger i defined and once with some default-styled logger that does not occur in 1.7.0.
That same default logger also logs the other (non-main) process.
INFO:midaGAN.utils.environment:PyTorch version: 1.8.0
[2021-03-08 21:05:36,843][midaGAN.utils.environment][INFO] - PyTorch version: 1.8.0
INFO:midaGAN.utils.environment:CUDA 10.2 - cuDNN 7605
[2021-03-08 21:05:36,843][midaGAN.utils.environment][INFO] - CUDA 10.2 - cuDNN 7605
INFO:midaGAN.utils.environment:Global rank: 0.
[2021-03-08 21:05:36,844][midaGAN.utils.environment][INFO] - Global rank: 0.
INFO:midaGAN.utils.environment:Local rank: 0.
[2021-03-08 21:05:36,844][midaGAN.utils.environment][INFO] - Local rank: 0.
.
.
.
# NOTE: the default logger is not enabled for other processes, but PyTorch is somehow logging it with a default-style logger
INFO:midaGAN.utils.environment:PyTorch version: 1.8.0
INFO:midaGAN.utils.environment:CUDA 10.2 - cuDNN 7605
INFO:midaGAN.utils.environment:Global rank: 1.
INFO:midaGAN.utils.environment:Local rank: 1.
Actually, in 1.7.0, after the first iter I see this message INFO:root:Reducer buckets have been rebuilt in this iteration. and then this weird logger starts logging like in 1.8.0.
INFO:root:Reducer buckets have been rebuilt in this iteration.
INFO:root:Added key: store_based_barrier_key:2 to store for rank: 1
INFO:root:Added key: store_based_barrier_key:2 to store for rank: 0
I meet same problem with distributed training when i update torch to 1.8.0. And i find that handlers in RootLogger which is the parent of other loggers is not empty. I check the code in pytorch, and i find it build a logger with logging.getLogger() which will return RootLogger. Thus, when we do ‘logger.info’ it will call the handlers of our logger and then call the handlers of RootLogger.
I solve this problem by empty the handlers in RootLogger before i build a new one.