How do we measure the allocated GPU memory during backward process?

We know that the forward process will retain all intermediate activations so that we can easily measure the allocated GPU memory of the forward like this:

pre_fw = torch.cuda.memory_allocated() / 1024**2
forward(...)
post_fw = torch.cuda.memory_allocated() / 1024**2
fw_g = post_fw - pre_fw

However, the backward process will drop the activations when the gradients are obtained. In other words, we cannot measure the correct allocated GPU memory of the backward via the above code.

Thus, I wonder how do we measure the allocated GPU memory during the backward process?

Not sure what you are trying to measure. Backward doesn’t additional allocate memory unless you are preparing to do double backward.