Can Gradient Checkpointing and Gradient Accumulation be together?

For train large batch,

can Gradient checkpointing and Gradient Accumulation be used together?

I think this should not be together because Gradient Checkpointing doesn’t utilize some of it’s layer’s computational graph and also off their requires_grad flag, so accumulation steps won’t be added at all

Am I wrong? please tell me the right answer!

I think you are wrong as I can’t think of a limitation why this wouldn’t work or are you seeing any issues while using both utilities?

That’s not entirely true. Gradient checkpointing will recompute specific operations to trade compute for memory. However, the result (compared to a non checkpointed run) should be the same.