I’m currently using DDP training on a large dataset. In evaluation, I only test the rank0 model for simplicity. It takes ~40min to run one eval epoch, and I set
dist.barrier() in other threads to block the other models. However, since pytorch DDP has a default timeout of 30min, the training crashes everytime in the eval epoch.
According to the document, I can set
timeout to a larger number when initing DDP. I’m confused about the statement that this is applicable only if the environment variable
NCCL_ASYNC_ERROR_HANDLING is set to 1. Does that mean in the code, I should write something like
os.environ['NCCL_ASYNC_ERROR_HANDLING'] = 1 in order to use a larger timeout value?
NCCL_ASYNC_ERROR_HANDLING which one is suitable for my use case? Thanks!