Random cuda error at dist.barrier() after initialization before model creation

the program is first run dist.init_process_group() and then sync between works by dist.barriers(). After these, it will create the data loader, model and then do the training. However, it crashes at dist.barrier() with random error message. Here is a few error. Most of the cases, the job can run well but this happens now and then.

The dist.barrier() would synchronize the code and could thus potentially reraise a valid exception.
For the 3 different errors:

  • reduce the model or batch size to avoid the OOM
  • the NCCL error might be a red herring and could in fact also be the OOM issue
  • Could you post the stack trace here, please? Also, are you using custom CUDA code?

PS: you can post code snippets by wrapping them into three backticks ```, which makes debugging easier.

1 Like

Thanks. Eventually, it is hardware issue.