Can Gradient Checkpointing and Gradient Accumulation be together?

ben9004 · September 13, 2021, 5:56pm

For train large batch,

can Gradient checkpointing and Gradient Accumulation be used together?

I think this should not be together because Gradient Checkpointing doesn’t utilize some of it’s layer’s computational graph and also off their requires_grad flag, so accumulation steps won’t be added at all

Am I wrong? please tell me the right answer!
Thanks

ptrblck · September 14, 2021, 3:33am

I think you are wrong as I can’t think of a limitation why this wouldn’t work or are you seeing any issues while using both utilities?

That’s not entirely true. Gradient checkpointing will recompute specific operations to trade compute for memory. However, the result (compared to a non checkpointed run) should be the same.