Handling oom error during DDP backward

T_Qri · August 10, 2020, 9:54am

Hi, I’m wondering how to deal with occasional OOM error happend during DDP backward.

For forward, oom can be captured simply by a try-catch statement. For backward, however, loss.backward() performs gradient calculation and the registered hooks perform gradient reduction at the same time.

Is it possible to hang due to oom errors during backward in several process so that the other successful processes keep waiting for them? If so, is there a nice way to recover from this problem?

mrshenli · August 10, 2020, 2:51pm

Yes, it is. If one process hit OOM and skipped/reran the the backward pass, it would cause de-synchronization across processes in the same group, which would lead to hang or crash.

If so, is there a nice way to recover from this problem?

Yep, TorchElastic is built to solve this issue. cc @Kiuk_Chung

Kiuk_Chung · August 10, 2020, 11:58pm

Have a look here at:

https://pytorch.org/elastic/0.2.0/train_script.html - for instructions on how to write a “torchelastic compliant” train script
https://pytorch.org/elastic/0.2.0/quickstart.html - for a quickstart on launching your script with torchelastic

T_Qri · August 11, 2020, 2:21am

Thank you, I will try it out.

T_Qri · August 11, 2020, 2:21am

Thank you, I will have a try.