Distributed 1.8.0 logging twice in a single process, same code works properly in 1.7.0

Using conda pytorch. Running my code with python -m torch.distributed.launch --use_env --nproc_per_node 2 on a single node with 2 GPUs.

In my code I took care of the logging so that it is only logged by the main process and it used to work for previous PyTorch versions. However, in PyTorch 1.8.0, logging is done with an additional, default-style, logger, both for the main process and the other(s).

This is how it looks like in PyTorch 1.7.0:

[2021-03-08 21:03:51,597][midaGAN.utils.environment][INFO] - PyTorch version: 1.7.0
[2021-03-08 21:03:51,598][midaGAN.utils.environment][INFO] - CUDA 10.2 - cuDNN 7605
[2021-03-08 21:03:51,598][midaGAN.utils.environment][INFO] - Global rank: 0.
[2021-03-08 21:03:51,598][midaGAN.utils.environment][INFO] - Local rank: 0.

And this is for PyTorch 1.8.0. We can see the same messages of the main process being logged twice, once with the logger i defined and once with some default-styled logger that does not occur in 1.7.0.
That same default logger also logs the other (non-main) process.

INFO:midaGAN.utils.environment:PyTorch version: 1.8.0
[2021-03-08 21:05:36,843][midaGAN.utils.environment][INFO] - PyTorch version: 1.8.0
INFO:midaGAN.utils.environment:CUDA 10.2 - cuDNN 7605
[2021-03-08 21:05:36,843][midaGAN.utils.environment][INFO] - CUDA 10.2 - cuDNN 7605
INFO:midaGAN.utils.environment:Global rank: 0.
[2021-03-08 21:05:36,844][midaGAN.utils.environment][INFO] - Global rank: 0.
INFO:midaGAN.utils.environment:Local rank: 0.
[2021-03-08 21:05:36,844][midaGAN.utils.environment][INFO] - Local rank: 0.
.
.
.
# NOTE: the default logger is not enabled for other processes, but PyTorch is somehow logging it with a default-style logger
INFO:midaGAN.utils.environment:PyTorch version: 1.8.0
INFO:midaGAN.utils.environment:CUDA 10.2 - cuDNN 7605
INFO:midaGAN.utils.environment:Global rank: 1.
INFO:midaGAN.utils.environment:Local rank: 1.

Am I missing something?

Actually, in 1.7.0, after the first iter I see this message INFO:root:Reducer buckets have been rebuilt in this iteration. and then this weird logger starts logging like in 1.8.0.

I am not sure, what setup do you use for your logger? do you use PT lightning?

I don’t use PT Lightning. I based my logger setup on https://github.com/directgroup/direct/blob/d1eb9ba8706b25a7aae19b03587cd79cd3675822/direct/environment.py#L78, which calls https://github.com/directgroup/direct/blob/d1eb9ba8706b25a7aae19b03587cd79cd3675822/direct/utils/logging.py#L10 to setup the logger.

is it possible that torch distributed is defining a root logger itself?

The reason why I think that may the case is because, for example, messages from my code coming through this unwanted logger look like this:

INFO:midaGAN.utils.environment:PyTorch version: 1.8.0
INFO:Validator:Validation started

while the messages from DDP look like this:

INFO:root:Reducer buckets have been rebuilt in this iteration.
INFO:root:Added key: store_based_barrier_key:2 to store for rank: 1
INFO:root:Added key: store_based_barrier_key:2 to store for rank: 0

Could it be something with [DDP Logging] Log comm. hook in ddp logging by rohan-varma · Pull Request #52966 · pytorch/pytorch · GitHub?

Also, I checked with one of the persons who developed DIRECT if they have the same issue, and they do. they didn’t have it before neither.

I meet same problem with distributed training when i update torch to 1.8.0. And i find that handlers in RootLogger which is the parent of other loggers is not empty. I check the code in pytorch, and i find it build a logger with logging.getLogger() which will return RootLogger. Thus, when we do ‘logger.info’ it will call the handlers of our logger and then call the handlers of RootLogger.
I solve this problem by empty the handlers in RootLogger before i build a new one.

3 Likes

Could you be more specific about how to

I solve this problem by empty the handlers in RootLogger before i build a new one.