Standard way of combine the output of 2 different model

Hi, I have a dataset which consist of image and caption, I ran 2 pretrained model on image and caption seperately(resnet 50 for image and Albert for caption.)
As the 2 model’s output are the same(18 labels, using sigmoid) I am currently adding the output of 2 models together and divide by 2. I know this might be a dumb way of doing it. Can anyone give me some advice on what is the standard way of combining the output of 2 models? Thanks!

Do you have labels for the combined problem, if so pick a linear weight for each output w_1 * m_1 + w_2 * m2 and learn w_1 and w_2 if you don’t adding them or just picking your own weights can often work as a stopgap solution