Hi, I’m using an LSTM for image captioning with packed inputs (see the snippet below) and I wondered what is the proper way of testing it. Can I just set the model to eval and feed it the test set or should I write a custom sampling? I’m asking because I’m not sure how information is passed between hidden states when using packed inputs and I was wondering if it does “teacher forcing” (i.e. pass the target instead of the prediction) which would be a problem when testing.
While trying to understand the implementation of LSTM I’ve found this picture:
Judging by it I would say that during the forward pass the inputs to the memory units at time t are tag_t , hiddenstate_t, cellstate_t which makes me think that the prediction at time t+1 is made using the ground truth tag at time t rather than the prediction (I would call this “teacher forcing”).
When I do greedy 1-sampling I iteratively feed the predicted tag, and previous hidden/cell states to the memory unit to predict the next tag.
Since the accuracy I achieve with the sampling is lower than the accuracy of the forward pass I want to make sure I understand the reason, which I’m led to believe is due to ground truth tag vs predicted tag.
Do you think that’s not the case?
I asked because I’ve checked the documentation but I was not sure what the mentioned “input_t” is when LSTMs are used in combination with packed inputs, so I wanted to make sure packed inputs were handled the way I think they are.