Hi,
I am doing model distillation with mixed-precision training. But my code gives me some strange results: the results seem very different from training without distillation even if the distillation weight is set to zero. I wonder if I did something wrong in my use of mixed-precision training. The code is as follows
with torch.cuda.amp.autocast(enabled=True):
outputs_student = model_student(inputs, targets)
with torch.no_grad():
outputs_teacher = model_teacher(inputs, targets)
loss_distillation = distill_loss(outputs_student, outputs_teacher)
loss_student = some_loss(outputs_student)
loss = loss_student + weight*loss_distillation
optimizer.zero_grad()
scaler.scale(loss).backward()
scaler.step(self.optimizer)
scaler.update()
where scaler = torch.cuda.amp.GradScaler(enabled=True)
. As mentioned above, I tried to set weight
to 0, but the result was still very different from training using loss_student
only. Is there anything wrong in my use of mixed-precision training?