Hi,
I have an image captioning model, and I want to ensure that it generates sentences with correct facts by using a pretrained BERT classifier on top of it (the model generates a sentence --> we extract “facts” from the sentence using a pretrained classifier --> we calculate the loss on the classification).
I want to use the Huggingface Transformers library, but because their input is a sentence that they then tokenize, I am unsure of how to handle or compute the gradients.
Does anyone have any ideas how I can compute the gradients?