How the networks learns image captioning

Hi, i am a bit confused in image captioning. This is my first time building an image captioning model (i have built a multimodal learning based image captioning model where a cnn extracts image features and a lstm is responsible for generating sentences).
I searched encoder-decoder based, attention based and multimodal learning based image captioning models but i am not sure how the training is going. But for all models i guess the network is trained on the likelihood of training sentences and image features. A general explanation would be great…thank you