Hi, I am trying to train my LLM model on single node multi-gpu system. When I run the ddp code with a data size of 20M data points, it works fine as expected, but as soon as I increase the data size to 200M data points in the same code, i get the following error:
Traceback (most recent call last):
File "/home/ddp/multigpu.py", line 374, in <module>
mp.spawn(main, args=(world_size, df_train, df_val, df_test,bs,model,maxlen,lr,epochs,patience), nprocs=world_size)
File "/home/anaconda3/envs/py_gpu/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 239, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/anaconda3/envs/py_gpu/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/home/anaconda3/envs/py_gpu/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/anaconda3/envs/py_gpu/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/home/ddp/multigpu.py", line 331, in main
ddp_setup(rank, world_size)
File "/home/ddp/multigpu.py", line 35, in ddp_setup
init_process_group(backend="nccl", rank=rank, world_size=world_size)
File "/home/anaconda3/envs/py_gpu/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 920, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/anaconda3/envs/py_gpu/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 459, in _store_based_barrier
raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=8, worker_count=6, timeout=0:30:00)
What can be the cause of this error? Also, the same code when run on different node with 3 gpu’s, it also works fine.