Knowledge Distillation with sigmoid activate function

Hello everyone.
I ask some question…

How can i use knowledge distillation with sigmoid activate function

I want use knowledge distillation in multi-label classification (ex. 1, 0, 1, 1, 0). So i use sigmoid activate function with BCEloss. but almost knowledge distillation is constructed with softmax and KLdiv.

I think it would be possible if I convert it to the code below, but I don’t know if it’s correct

alpha *  nn.BCEWithLogitsLoss()(F.sigmoid(outputs/T), F.sigmoid(teacher_outputs/T)) + nn.BCEWithLogitsLoss()(outputs, labels) * (1-alpha)

If you have any information about the sigmoid function and the knowledge distillation technique based on BCEloss, please let me know.