I’m interested in using OpenAI’s CLIP as a pre-trained module, part of a network with other trainable module. Input will be first passed through CLIP (e.g., image encoder) for feature extraction, which will subsequently be used as input to a trainable module. The network needs to be trained using back-propagation under some criterion and so on.
My question is, whether it suffices to use CLIP in eval mode (.eval()
) or whether I need to set its parameters requires_grad
to false as well. More importantly, do I necessarily NEED to have CLIP’s parameters with requires_grad=True
in order to back-propagate and update the trainable network’s weights? I’m asking because if I don’t set requires_grad=False
I get error during loss.backward()
call, where it suggests to use retain_graph=True
etc. It feels that I don’t need to use retain_graph=True
in the backwards call, but I might be wrong obviously.
Thank you!