I’ve tried to train intermediate layers independently but when I run my code I’ve got out of memory issue.
I would like to know retain_graph = True is the problem? or not and what I have to do to solve this problem.
I am using hugging face BERT model and this is my code
Thank you.
t_outputs = t_model(**intputs)
s_outputs = s_model(**inputs)
encoder_layers = args.encoder_layers
loss = torch.nn.KLDivLoss(reduction='batchmean')
batch, row, col = s_outputs[3][1].size()
for i, k in enumerate(encoder_layers):
output = loss(F.log_softmax(s_outputs[3][i+1], dim=2).view(batch*row, 1, col), F.softmax(t_result[3][k+1],dim=2).view(batch*row, 1, col))
# to freeze layers
for name, p in s_model.named_parameters():
if "layer." + str(i-1) in name:
p.requires_grad = False
output.backward(retain_graph=True)