Hello!
I’m experimenting with distributed training using NVIDIA Megatron-LM project. And I get an error when running the script bash scripts/pretrain_gpt2_model_parallel.sh
Traceback looks like
File "pretrain_gpt2.py", line 625, in <module>
main()
File "pretrain_gpt2.py", line 569, in main
args.eod_token = get_train_val_test_data(args)
File "pretrain_gpt2.py", line 536, in get_train_val_test_data
group=mpu.get_model_parallel_group())
File "/home/ubuntu/Env/ml/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 810, in broadcast
work = group.broadcast([tensor], opts)
RuntimeError: Broken pipe
Traceback (most recent call last):
File "pretrain_gpt2.py", line 625, in <module>
main()
File "pretrain_gpt2.py", line 569, in main
args.eod_token = get_train_val_test_data(args)
File "pretrain_gpt2.py", line 536, in get_train_val_test_data
group=mpu.get_model_parallel_group())
File "/home/ubuntu/Env/ml/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 810, in broadcast
work = group.broadcast([tensor], opts)
RuntimeError: Broken pipe
Traceback (most recent call last):
File "pretrain_gpt2.py", line 625, in <module>
main()
File "pretrain_gpt2.py", line 569, in main
args.eod_token = get_train_val_test_data(args)
File "pretrain_gpt2.py", line 536, in get_train_val_test_data
group=mpu.get_model_parallel_group())
File "/home/ubuntu/Env/ml/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 810, in broadcast
work = group.broadcast([tensor], opts)
RuntimeError: Broken pipe
Traceback (most recent call last):
File "pretrain_gpt2.py", line 625, in <module>
main()
File "pretrain_gpt2.py", line 569, in main
args.eod_token = get_train_val_test_data(args)
File "pretrain_gpt2.py", line 536, in get_train_val_test_data
group=mpu.get_model_parallel_group())
File "/home/ubuntu/Env/ml/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 810, in broadcast
work = group.broadcast([tensor], opts)
RuntimeError: Broken pipe
The error occurs in the file pretrain_gpt2.py
Could anybody help me with this issue?