How to apply knowledge distillation in a multi-label setup?

Since knowledge distillation seems to work on the softmax of the logits, how can we apply knowledge distillation in a multi-label setting where generally a sigmoid is used to obtain the probability of each label?