can anyone explain me to LSTM image captioning training, suppose as an example single image has 5 image captions(all sentence are equal length). how do we train LSTM? do we need to train 5 times or only ones with a random sentence?
I don’t think there’s any real difference. Once you do multiple epochs, the network basically will see every image-caption pair anyway.