Hello,
I would like to ask a question please.
The question concerns a strange error that I encountered when using fairseq, but I believe the issue concerns PyTorch in general.
After upgrading to PyTorch 1.7, I frequently had a memory issue with my trainings:
File "/home/user/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 943, in all_reduce
work = group.allreduce([tensor], opts)work = group.allreduce([tensor], opts)work = group.allreduce([tensor], opts)
MemoryErrorMemoryErrorMemoryError: : : std::bad_allocstd::bad_allocstd::bad_alloc
This issue happened randomly: for an exact same configuration (same model, training data, and everything else) across different runs: sometimes the job runs well and sometimes there is a memory error or socket timeout error.
After downgrading back to 1.6, I no longer observed this issue. However, I was not able to resume trainings that I had started on 1.7 because of the following error:
File "/home/user/code/fairseq/fairseq/optim/adam.py", line 199, in step
self.optimizer.step(closure)
File "/home/user/code/fairseq/fairseq/optim/adam.py", line 199, in step
state["step"] += 1
state["step"] += 1KeyErrorstate["step"] += 1
:
'step'KeyError
: KeyError'step':
'step'
state["step"] += 1
KeyError: 'step'
(The error happened at this line in the code.)
I have manually checked the checkpoints saved by 1.7 and found that their optimizer states indeed do not have the key step, while the 1.6 checkpoints do.
Could you please tell me how I can continue my trainings on PyTorch 1.6 by resuming from the checkpoints saved by PyTorch 1.7?
Thank you so much in advance for your help!