Error occurred when executing loss.backward() in pytorch with distributed training

Traceback (most recent call last):
  File "./pretraining/run_pretraining.py", line 440, in <module>
    main()
  File "./pretraining/run_pretraining.py", line 384, in main
    loss.backward()
  File "/data/anaconda/envs/bzheng_env/lib/python3.6/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/data/anaconda/envs/bzheng_env/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/data/anaconda/envs/bzheng_env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/parallel/distributed.py", line 310, in overlapping_backward_epilogue
    "This probably indicates some buckets were not allreduced.")
RuntimeError: ('In epilogue, next_bucket (0) != num_buckets (1).  ', 'This probably indicates some buckets were not allreduced.')

This error occurred when executing loss.backward() in pytorch with distributed training.

It occurs even using only a single gpu while the program runs normally without distributed training.

Have anyone met the same problem with me?

The command I use to start the program is shown follows:

python -u -m torch.distributed.launch --nproc_per_node=1 ./pretraining/run_pretraining.py ******

Do you distributedDataParallel(model) in your program?
You should change some code in your program when using torch.distributed.launch

@11116 I see you’re using Apex. The error message in the epilogue means that not all learnable parameters in your model had their gradients computed (i.e. they didn’t participate in the forward pass). This is possible if you use any type of control flow in your forward pass that excludes use of certain parameters. You can fix this in Apex by delaying all reduction until the very end of the backwards pass by using the delay_allreduce option. See https://nvidia.github.io/apex/parallel.html for more details.