How to set longer timeout for DDP training?

Dazitu616 · February 3, 2022, 5:58pm

I’m currently using DDP training on a large dataset. In evaluation, I only test the rank0 model for simplicity. It takes ~40min to run one eval epoch, and I set dist.barrier() in other threads to block the other models. However, since pytorch DDP has a default timeout of 30min, the training crashes everytime in the eval epoch.

According to the document, I can set timeout to a larger number when initing DDP. I’m confused about the statement that this is applicable only if the environment variable NCCL_BLOCKING_WAIT or NCCL_ASYNC_ERROR_HANDLING is set to 1. Does that mean in the code, I should write something like os.environ['NCCL_ASYNC_ERROR_HANDLING'] = 1 in order to use a larger timeout value? NCCL_BLOCKING_WAIT and NCCL_ASYNC_ERROR_HANDLING which one is suitable for my use case? Thanks!

pbelevich · February 3, 2022, 8:00pm

@rvarm1 @Yanli_Zhao ?

Dazitu616 · February 4, 2022, 2:08am

OK so I tried setting timeout=timedelta(xxx) in init_process_group() and that seems to work. Maybe we don’t need to manually set these environment variables

Stephen-Hao · August 30, 2022, 7:28am

Thank you very much! I have solved the same problem.