I am trying a distillation-like setup where predictions from one model (say model X) are used as target for another model (say model Y). Since model X is pre-trained, I compute the logits under torch.eval context and train model Y with it.
In this case, would Pytorch maintain the computation graph (and intermediate activations) corresponding to model X or free up the computational graph as soon as the forward pass for model X is complete? Is there a way to free the memory held by computational graph for model X?