Auxiliary Loss with Gradient Checkpointing in LLMs

simon-lund · March 13, 2024, 2:04pm

Update:

Found this post of yours (Checkpoint with no grad requiring inputs PROBLEM - #9 by ptrblck) which explains why backward breaks when all inputs don’t require grad.

I updated the code accordingly, I am just setting requires_grad=True for x and use_reentrant=True.

Updated Gist:

gist.github.com

https://gist.github.com/simon-lund/3aa518871b3765282a738d0b79b5ea22

grad_chkp2.py

"""
This is a test network moe with gradient checkpointing.
There are three blocks, all wrapped with gradient checkpointing.
The first block only contains a router and the "original layer".
The other two blocks contain the "original layer", and two expert layers.

We have two setups, one where the router scores are returned by the block's forward method and passed down to the other blocks.
The second setup is where the router scores are set directly on the last two blocks as attributes. They are then to be used in the forward method of the expert aggregator of the last two blocks.
"""
import pytest

This file has been truncated. show original

This works for the version, where the scores are explicitly passed to the block’s forward method but not when setting the scores as attribute. Here I get the following error:

 Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
            tensors, grad_tensors_, retain_graph, create_graph, inputs,
            allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
E       RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.