Hello everyone,
I was hoping if anyone could help me with preparing the training data for a model composed of an encoder and decoder. I would like to know how feed the model with a text (paragraph) with the correspondent image (handwritten).
First layer of encoder:
FCN_Encoder(
(init_blocks): ModuleList(
(0): ConvBlock(
(activation): ReLU()
(conv1): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(conv2): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(conv3): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(norm_layer): InstanceNorm2d(32, eps=0.001, momentum=0.99, affine=False, track_running_stats=False)
(dropout): MixDropout(
(dropout): Dropout(p=0.5, inplace=False)
(dropout2d): Dropout2d(p=0.25, inplace=False)
)
)
Decoder:
Decoder(
(end_conv): Conv2d(512, 101, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
)
I would really appreciate your help!