Handle out of memory issue when using gradient accumulation

HuangZiliAndy · November 12, 2020, 3:02pm

Hi, I am trying to handle the out of memory issue. I see someone point to this example fairseq. But my situation is that I am also using some gradient accumulation. It would be a waste of data if I directly del p.grad. So my question is (1) when I catch an OOM during forward/backward, is the gradient already added/ not added to p.grad? (2) Or do you have some suggestions on how to handle the OOM exception when using the gradient accumulation? Thanks!