Error while setting up ddp

Hi, I am trying to train my LLM model on single node multi-gpu system. When I run the ddp code with a data size of 20M data points, it works fine as expected, but as soon as I increase the data size to 200M data points in the same code, i get the following error:

Traceback (most recent call last):
  File "/home/ddp/multigpu.py", line 374, in <module>
    mp.spawn(main, args=(world_size, df_train, df_val, df_test,bs,model,maxlen,lr,epochs,patience), nprocs=world_size)                                                                               
  File "/home/anaconda3/envs/py_gpu/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 239, in spawn                                                                             
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/anaconda3/envs/py_gpu/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes                                                                   
    while not context.join():
  File "/home/anaconda3/envs/py_gpu/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 160, in join                                                                              
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/anaconda3/envs/py_gpu/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap                                                                              
    fn(i, *args)
  File "/home/ddp/multigpu.py", line 331, in main
    ddp_setup(rank, world_size)
  File "/home/ddp/multigpu.py", line 35, in ddp_setup
    init_process_group(backend="nccl", rank=rank, world_size=world_size)
  File "/home/anaconda3/envs/py_gpu/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 920, in init_process_group                                                         
    _store_based_barrier(rank, store, timeout)
  File "/home/anaconda3/envs/py_gpu/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 459, in _store_based_barrier                                                       
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=8, worker_count=6, timeout=0:30:00)                                  

What can be the cause of this error? Also, the same code when run on different node with 3 gpu’s, it also works fine.

It looks like some workers did not join when initializing the process group. Could the increase in your dataset size and any potential preprocessing steps be leading to desynchronization across workers, which could cause this timeout?

In preprocessing, mainly I import the dataframe and then split into train validation and test dataframes before calling mp.spawn. Also, the same code when run on a different node having 3 gpu’s works fine.