File_system sharing strategy in multiprocessing gets collective ops stuck

christianperone · March 4, 2022, 5:48pm

If I use the file_system strategy in multiprocessing with a distributed training, my collective calls through NCCL get stuck and then timeouts, this happens randomly in different times at every training session. Is there anything known that would make file_system strategy fail when doing distributed training ?

pritamdamania87 · March 4, 2022, 7:12pm

My first guess is that this is probably not related to file_system strategy. The reason being that the file_system is only used at initialization time and after that there is no dependency on it. I would suggest running your code with environment variables NCCL_DEBUG=INFO TORCH_DISTRIBUTED_DEBUG=DETAIL and share the output of that. This would give a lot more insight into what might be going on here.

christianperone · March 7, 2022, 1:34pm

Thank you @pritamdamania87, with the default strategy it works fine but fails due to many file descriptors open, and with the file_system it fails during collective calls. I will try to debug further, but distributed debugging can be pretty difficult.