Knowledge Distillation

Hello everyone, I’m using knowledge distillation to train a model. The teacher model has previously been trained and is designed to guide the student. However, when the student trains, the total loss is negative. Is this normal behaviour during training? Any suggestions are welcome. Thank you very much.
Here is a section of the code.

temp = 5
alpha = 0.9
for phase in ['train', 'val']:
    ......
    losses = []
    for data, targets in pbar:
        data = data.to(device)
        targets = targets.to(device)
    
        with torch.no_grad():
            teacher_preds = teacher(data)
    
        student_preds = student(data)
        _, preds = torch.max(student_preds, 1)
        student_loss = student_loss_fn(student_preds, targets)

        ditillation_loss = divergence_loss_fn(
            F.softmax(student_preds / temp, dim=1),
            F.softmax(teacher_preds / temp, dim=1)
        )

        loss = alpha * student_loss + (1-alpha) * ditillation_loss
        losses.append(loss.item())
        running_corrects += torch.sum(preds == targets.data)
        curr_train_samples += len(targets)
        # backward
        optimizer.zero_grad()
        if phase == 'train':
            loss.backward()
            optimizer.step()

    epoch_loss = sum(losses) / len(losses)
    epoch_acc = running_corrects.double() / curr_train_samples
     ....
    print(f'\t{phase}: Loss: {epoch_loss:.4f} Acc: {epoch_acc:.4f}')

Sample output

Epoch 1

train: Loss: -0.2539 Acc: 0.3342
val: Loss: -0.2725 Acc: 0.4614

Epoch 2

train: Loss: -0.4222 Acc: 0.5226
val: Loss: -0.4394 Acc: 0.5932

I think I found this one helpful, and solved the issue I faced.