Standard way of combine the output of 2 different model

Hi, I have a dataset which consist of image and caption, I ran 2 pretrained model on image and caption seperately(resnet 50 for image and Albert for caption.)
As the 2 model’s output are the same(18 labels, using sigmoid) I am currently adding the output of 2 models together and divide by 2. I know this might be a dumb way of doing it. Can anyone give me some advice on what is the standard way of combining the output of 2 models? Thanks!