I’m currently using DDP training on a large dataset. In evaluation, I only test the rank0 model for simplicity. It takes ~40min to run one eval epoch, and I set dist.barrier()
in other threads to block the other models. However, since pytorch DDP has a default timeout of 30min, the training crashes everytime in the eval epoch.
According to the document, I can set timeout
to a larger number when initing DDP. I’m confused about the statement that this is applicable only if the environment variable NCCL_BLOCKING_WAIT
or NCCL_ASYNC_ERROR_HANDLING
is set to 1. Does that mean in the code, I should write something like os.environ['NCCL_ASYNC_ERROR_HANDLING'] = 1
in order to use a larger timeout value? NCCL_BLOCKING_WAIT
and NCCL_ASYNC_ERROR_HANDLING
which one is suitable for my use case? Thanks!