For train large batch,
can Gradient checkpointing and Gradient Accumulation be used together?
I think this should not be together because Gradient Checkpointing doesn’t utilize some of it’s layer’s computational graph and also off their requires_grad flag, so accumulation steps won’t be added at all
Am I wrong? please tell me the right answer!