How to set longer timeout for DDP training?

I’m currently using DDP training on a large dataset. In evaluation, I only test the rank0 model for simplicity. It takes ~40min to run one eval epoch, and I set dist.barrier() in other threads to block the other models. However, since pytorch DDP has a default timeout of 30min, the training crashes everytime in the eval epoch.

According to the document, I can set timeout to a larger number when initing DDP. I’m confused about the statement that this is applicable only if the environment variable NCCL_BLOCKING_WAIT or NCCL_ASYNC_ERROR_HANDLING is set to 1. Does that mean in the code, I should write something like os.environ['NCCL_ASYNC_ERROR_HANDLING'] = 1 in order to use a larger timeout value? NCCL_BLOCKING_WAIT and NCCL_ASYNC_ERROR_HANDLING which one is suitable for my use case? Thanks!

@rvarm1 @Yanli_Zhao ?

OK so I tried setting timeout=timedelta(xxx) in init_process_group() and that seems to work. Maybe we don’t need to manually set these environment variables

Thank you very much! I have solved the same problem.