Hello everyone, I’m using knowledge distillation to train a model. The teacher model has previously been trained and is designed to guide the student. However, when the student trains, the total loss is negative. Is this normal behaviour during training? Any suggestions are welcome. Thank you very much.
Here is a section of the code.
temp = 5
alpha = 0.9
for phase in ['train', 'val']:
......
losses = []
for data, targets in pbar:
data = data.to(device)
targets = targets.to(device)
with torch.no_grad():
teacher_preds = teacher(data)
student_preds = student(data)
_, preds = torch.max(student_preds, 1)
student_loss = student_loss_fn(student_preds, targets)
ditillation_loss = divergence_loss_fn(
F.softmax(student_preds / temp, dim=1),
F.softmax(teacher_preds / temp, dim=1)
)
loss = alpha * student_loss + (1-alpha) * ditillation_loss
losses.append(loss.item())
running_corrects += torch.sum(preds == targets.data)
curr_train_samples += len(targets)
# backward
optimizer.zero_grad()
if phase == 'train':
loss.backward()
optimizer.step()
epoch_loss = sum(losses) / len(losses)
epoch_acc = running_corrects.double() / curr_train_samples
....
print(f'\t{phase}: Loss: {epoch_loss:.4f} Acc: {epoch_acc:.4f}')
Sample output
Epoch 1
train: Loss: -0.2539 Acc: 0.3342
val: Loss: -0.2725 Acc: 0.4614
Epoch 2
train: Loss: -0.4222 Acc: 0.5226
val: Loss: -0.4394 Acc: 0.5932