We know that the forward process will retain all intermediate activations so that we can easily measure the allocated GPU memory of the forward like this:
pre_fw = torch.cuda.memory_allocated() / 1024**2
forward(...)
post_fw = torch.cuda.memory_allocated() / 1024**2
fw_g = post_fw - pre_fw
However, the backward process will drop the activations when the gradients are obtained. In other words, we cannot measure the correct allocated GPU memory of the backward via the above code.
Thus, I wonder how do we measure the allocated GPU memory during the backward process?