Traceback (most recent call last):
File "./pretraining/run_pretraining.py", line 440, in <module>
main()
File "./pretraining/run_pretraining.py", line 384, in main
loss.backward()
File "/data/anaconda/envs/bzheng_env/lib/python3.6/site-packages/torch/tensor.py", line 102, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/data/anaconda/envs/bzheng_env/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward
allow_unreachable=True) # allow_unreachable flag
File "/data/anaconda/envs/bzheng_env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/parallel/distributed.py", line 310, in overlapping_backward_epilogue
"This probably indicates some buckets were not allreduced.")
RuntimeError: ('In epilogue, next_bucket (0) != num_buckets (1). ', 'This probably indicates some buckets were not allreduced.')
This error occurred when executing loss.backward() in pytorch with distributed training.
It occurs even using only a single gpu while the program runs normally without distributed training.
Have anyone met the same problem with me?
The command I use to start the program is shown follows:
python -u -m torch.distributed.launch --nproc_per_node=1 ./pretraining/run_pretraining.py ******