Input for fine-tuning transformers PyTorch

Hello everyone,
I was hoping if anyone could help me with preparing the training data for a model composed of an encoder and decoder. I would like to know how feed the model with a text (paragraph) with the correspondent image (handwritten).
First layer of encoder:

FCN_Encoder(
  (init_blocks): ModuleList(
    (0): ConvBlock(
      (activation): ReLU()
      (conv1): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (conv2): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (conv3): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (norm_layer): InstanceNorm2d(32, eps=0.001, momentum=0.99, affine=False, track_running_stats=False)
      (dropout): MixDropout(
        (dropout): Dropout(p=0.5, inplace=False)
        (dropout2d): Dropout2d(p=0.25, inplace=False)
      )
    )

Decoder:

Decoder(
(end_conv): Conv2d(512, 101, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
)

I would really appreciate your help! :blush: