Tips for implementing knowledge distillation

I’m trying to implement a vanilla knowledge distillation (compare outputs of teacher and student models via cross-entropy loss) and I want to get some tips for implementation. Especially, I want to know if there’s a way to save memory since we need to load two models on GPU and train the student model with output of both models, which requires additional memory cost compared to training only a student model without distillation. Is there a way to save memory by, for example, using multiple GPUs? I think it is theoretically possible to load two models on different GPUs and train the student model by using a single batch that is copied to both GPUs. It seems that most of the publicly available implementations load two models in the same GPU.

1 Like