the program is first run dist.init_process_group() and then sync between works by dist.barriers(). After these, it will create the data loader, model and then do the training. However, it crashes at dist.barrier() with random error message. Here is a few error. Most of the cases, the job can run well but this happens now and then.