I’ve tried to train intermediate layers independently but when I run my code I’ve got out of memory issue.
I would like to know retain_graph = True is the problem? or not and what I have to do to solve this problem.
I am using hugging face BERT model and this is my code
t_outputs = t_model(**intputs) s_outputs = s_model(**inputs) encoder_layers = args.encoder_layers loss = torch.nn.KLDivLoss(reduction='batchmean') batch, row, col = s_outputs.size() for i, k in enumerate(encoder_layers): output = loss(F.log_softmax(s_outputs[i+1], dim=2).view(batch*row, 1, col), F.softmax(t_result[k+1],dim=2).view(batch*row, 1, col)) # to freeze layers for name, p in s_model.named_parameters(): if "layer." + str(i-1) in name: p.requires_grad = False output.backward(retain_graph=True)