Training different stage of model with different loss

ptrblck · November 24, 2024, 4:41pm

You are most likely running into this issue which fails to compute the gradients since the forward activations are stale after a parameter update.