I am trying to distill the model ModifisedResnet in CLIP.
I use (2,2,2,2) for the convolutional blocks, 64 heads, 32 width and output dimension 1024. That makes the model around 30 MB.
I put the teacher into eval mode. Create the embedding for the image with the teacher and run it through a softmax. Then I do the same with the student but not put it in eval mode and here I do log softmax. I take the results of both and put it into KDL as a loss function and start the training.
Do I need to consider anything else here? The goal is to have a smaller CLIP model that reproduces the orginal results as closely as possible.
Any ideas are welcome.