Hello, I am trying to implement a sequence to sequence based model for hand written text recognition. The paper is here (SCAN) which primarily uses Convolution Sequence to Sequence [Gehring]. I am using IAM Dataset which has almost 7000 training image and text pairs. I have modeled in almost the same as the SCAN paper indicates but there are some issues with my implementation. The model is giving very poor accuracy on test set. After the first epoch, the loss value doesn’t increases but revolves around a range of value. This is my very first paper implementation on Pytorch. I am linking the Google Colab notebook here, please help me out in this issue.
Just to clarify, the images are scaled to (32, ) maintaining aspect ratio. The images are batched and padded since all images have different widths. The image batch is passed through the Convolution Extraction layer which outputs feature in shape [batch_size, channel, seq_len]. The encoder is fed with above output with sequence expanding across seq_len. The text are also batched and padded with token with and token at start and end. While training, the decoder initially is fed with token and further the decoder outputs are fed back in to decoder as decoder_inputs. I hope this somehow clarifies what my approach is.
Let me know of any suggestions or improvements.