Compute gradients when using BERT classifier on top of image captioning model

Hi,

I have an image captioning model, and I want to ensure that it generates sentences with correct facts by using a pretrained BERT classifier on top of it (the model generates a sentence --> we extract “facts” from the sentence using a pretrained classifier --> we calculate the loss on the classification).

I want to use the Huggingface Transformers library, but because their input is a sentence that they then tokenize, I am unsure of how to handle or compute the gradients.

Does anyone have any ideas how I can compute the gradients?