Hello,
I would like to ask a question please.
The question concerns a strange error that I encountered when using fairseq, but I believe the issue concerns PyTorch in general.
After upgrading to PyTorch 1.7, I frequently had a memory issue with my trainings:
File "/home/user/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 943, in all_reduce
work = group.allreduce([tensor], opts)work = group.allreduce([tensor], opts)work = group.allreduce([tensor], opts)
MemoryErrorMemoryErrorMemoryError: : : std::bad_allocstd::bad_allocstd::bad_alloc
This issue happened randomly: for an exact same configuration (same model, training data, and everything else) across different runs: sometimes the job runs well and sometimes there is a memory error or socket timeout error.
After downgrading back to 1.6, I no longer observed this issue. However, I was not able to resume trainings that I had started on 1.7 because of the following error:
File "/home/user/code/fairseq/fairseq/optim/adam.py", line 199, in step
self.optimizer.step(closure)
File "/home/user/code/fairseq/fairseq/optim/adam.py", line 199, in step
state["step"] += 1
state["step"] += 1KeyErrorstate["step"] += 1
:
'step'KeyError
: KeyError'step':
'step'
state["step"] += 1
KeyError: 'step'
(The error happened at this line in the code.)
I have manually checked the checkpoints saved by 1.7 and found that their optimizer states indeed do not have the key step
, while the 1.6 checkpoints do.
Could you please tell me how I can continue my trainings on PyTorch 1.6 by resuming from the checkpoints saved by PyTorch 1.7?
Thank you so much in advance for your help!