Error when resuming training: No 'step' key in optimizer state (PyTorch 1.7)

miel · January 8, 2021, 6:00pm

Hello,

I would like to ask a question please.

The question concerns a strange error that I encountered when using fairseq, but I believe the issue concerns PyTorch in general.

After upgrading to PyTorch 1.7, I frequently had a memory issue with my trainings:

File "/home/user/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 943, in all_reduce
            work = group.allreduce([tensor], opts)work = group.allreduce([tensor], opts)work = group.allreduce([tensor], opts)
MemoryErrorMemoryErrorMemoryError: : : std::bad_allocstd::bad_allocstd::bad_alloc

This issue happened randomly: for an exact same configuration (same model, training data, and everything else) across different runs: sometimes the job runs well and sometimes there is a memory error or socket timeout error.

After downgrading back to 1.6, I no longer observed this issue. However, I was not able to resume trainings that I had started on 1.7 because of the following error:

File "/home/user/code/fairseq/fairseq/optim/adam.py", line 199, in step
    self.optimizer.step(closure)
  File "/home/user/code/fairseq/fairseq/optim/adam.py", line 199, in step
    state["step"] += 1
        state["step"] += 1KeyErrorstate["step"] += 1
: 
'step'KeyError
: KeyError'step': 
'step'
    state["step"] += 1
KeyError: 'step'

(The error happened at this line in the code.)

I have manually checked the checkpoints saved by 1.7 and found that their optimizer states indeed do not have the key step, while the 1.6 checkpoints do.

Could you please tell me how I can continue my trainings on PyTorch 1.6 by resuming from the checkpoints saved by PyTorch 1.7?

Thank you so much in advance for your help!

ptrblck · January 18, 2021, 3:18am

You could most likely add the missing step state manually to the optimizer.

However, I would generally recommend to stick to the latest version so we would need to look into the memory issue. Do you have a code snippet to reproduce this issue and could you post your current setup (PyTorch, CUDA, cudnn, NCCL, versions etc.)?

miel · October 6, 2021, 1:01pm

Thank you for your reply! I used PyTorch 1.7 to resume the trainings that I needed and just used the latest version for all later trainings.