I want to eliminate the accumulation of memory usage during the learning loop

Memory will be overwhelmed during the learning loop of the following code. Each loop is opening variables and opening the computed graph by loss.backward, and I don’t think there is anything to accumulate. (My understanding is that the gradient information is updated at each loop, but since the gradient is one-to-one for the parameters, it is not something that is accumulated.)

Also, the model uses hf models like meta-llama/Llama-2-7b-hf, but the memory increase in the 7B model is 30MB compared from the beginning to the end of the loop, whereas in the 70B it increases enough to cause memory leaks (about 1000MB per loop? )

Does anyone know the cause of this?

Translated with DeepL.com (free version)

import torch

def forward(model, state, y, normailized_similarity_scores, args, eval=False):
    loss_value = 0.0
    for n in range(args["batch_size"]):
        token_length = len(y[n])
        for t in range(token_length):
            if y[n][t] == 0:
                continue
            outputs = model(input_ids=state[n][t].detach())
            logits = outputs.logits

            last_token_logits = logits[0, -1, :]

            log_probs = torch.nn.functional.log_softmax(last_token_logits, dim=-1)
            target_token_log_prob = log_probs[y[n][t].detach().item()]

            baseline = torch.mean(normailized_similarity_scores)
            reward = normailized_similarity_scores[n].item()
            R = reward / baseline

            loss = target_token_log_prob * R / args["batch_size"] / token_length

            if eval == False:
                loss.backward()
            loss_value += loss.detach().item()
            del outputs, logits, last_token_logits, log_probs, target_token_log_prob, baseline, reward, R, loss
            torch.cuda.empty_cache()

    return loss_value