Reading a word problem stored in a image?

I wanted to see what models would be best suited for taking in a image and having the model perform a image to text (OCR of some kind) and then outputting the contents of the image in a “question” “answers” format.

The set of images have a question in the center and then a single column table that represent one answer choice. I would imagine a model would have to be retrained to know that the text in the center of the image is the question, and each column representants a answer choice. In this case there are two answer choices.

The model does not need to answer the question, it just needs to output the text.

Any models that would make a good start here for research’s (has this already been done?) or if no models are good for the task then what is the best approach for training a model to complete these tasks.

The result of the model would output the JSON like below or something similar.

{

  {

    Question: What is the sound that the cereal rice crispes make? Snap, (i)___, (ii) ___.

  },

  {

    Answer 1: Crackle, Pop, Crunch

    Answer 2: Burp, Pop, Crackle

  }

}