Ddp logging into file

sbelharbi · September 9, 2021, 6:29pm

hi Can,

yes, by simply replacing torch.distributed.launch, which works fine, with torch.distributed.run the code hangs. in the example i provided above, i removed torch.distributed.barrier that used to test something. i think it is the reason for hanging. because when i type ctrl+c to stop the process, the last line printed in the stack was sleeping. it is like the process was waiting for something. and my guess it is that the process was stuck in the barrier. the same code works fine with launch. i cant run anything right now due to a power outage in our servers. please, give me some time to provide a full example. for torchrun, that explains it why an error was thrown for not recognizing it because i am using torch 1.9.0.

this is good news. thanks. it is not urgent for me right now. but, reading these logs was essential to find the cause of an issue where hints were buried in the first logs that were printed in terminal. i’ll wait for the next release. probably, it could be helpful for others to add this aspect in the doc of 1.9.0. for example, that ddp will turn off multi-threading… unless OMP_NUM_THREADS is explicitly set > 1; this was one of the warnings that i missed because the printing on terminal was fast, and mixed with my own logger.

thanks