If I use the file_system
strategy in multiprocessing with a distributed training, my collective calls through NCCL get stuck and then timeouts, this happens randomly in different times at every training session. Is there anything known that would make file_system
strategy fail when doing distributed training ?
My first guess is that this is probably not related to file_system
strategy. The reason being that the file_system
is only used at initialization time and after that there is no dependency on it. I would suggest running your code with environment variables NCCL_DEBUG=INFO TORCH_DISTRIBUTED_DEBUG=DETAIL
and share the output of that. This would give a lot more insight into what might be going on here.
Thank you @pritamdamania87, with the default strategy it works fine but fails due to many file descriptors open, and with the file_system
it fails during collective calls. I will try to debug further, but distributed debugging can be pretty difficult.