Memory will be overwhelmed during the learning loop of the following code. Each loop is opening variables and opening the computed graph by loss.backward, and I don’t think there is anything to accumulate. (My understanding is that the gradient information is updated at each loop, but since the gradient is one-to-one for the parameters, it is not something that is accumulated.)
Also, the model uses hf models like meta-llama/Llama-2-7b-hf, but the memory increase in the 7B model is 30MB compared from the beginning to the end of the loop, whereas in the 70B it increases enough to cause memory leaks (about 1000MB per loop? )
Does anyone know the cause of this?
Translated with DeepL.com (free version)
import torch
def forward(model, state, y, normailized_similarity_scores, args, eval=False):
loss_value = 0.0
for n in range(args["batch_size"]):
token_length = len(y[n])
for t in range(token_length):
if y[n][t] == 0:
continue
outputs = model(input_ids=state[n][t].detach())
logits = outputs.logits
last_token_logits = logits[0, -1, :]
log_probs = torch.nn.functional.log_softmax(last_token_logits, dim=-1)
target_token_log_prob = log_probs[y[n][t].detach().item()]
baseline = torch.mean(normailized_similarity_scores)
reward = normailized_similarity_scores[n].item()
R = reward / baseline
loss = target_token_log_prob * R / args["batch_size"] / token_length
if eval == False:
loss.backward()
loss_value += loss.detach().item()
del outputs, logits, last_token_logits, log_probs, target_token_log_prob, baseline, reward, R, loss
torch.cuda.empty_cache()
return loss_value