I’m interested in using OpenAI’s CLIP as a pre-trained module, part of a network with other trainable module. Input will be first passed through CLIP (e.g., image encoder) for feature extraction, which will subsequently be used as input to a trainable module. The network needs to be trained using back-propagation under some criterion and so on.
My question is, whether it suffices to use CLIP in eval mode (
.eval()) or whether I need to set its parameters
requires_grad to false as well. More importantly, do I necessarily NEED to have CLIP’s parameters with
requires_grad=True in order to back-propagate and update the trainable network’s weights? I’m asking because if I don’t set
requires_grad=False I get error during
loss.backward() call, where it suggests to use
retain_graph=True etc. It feels that I don’t need to use
retain_graph=True in the backwards call, but I might be wrong obviously.