Combining text feature vectors and image feature vectors

I have corresponding text and images that have a label too between 1-9. Using transfer learning on BERT (size 768), I dropped text feature vectors. Using transfer learning on top of ResNet50 I dropped image feature vectors (size 2048).

How can I design a transfer learning network (or another reasonable network) that combines these feature vectors? Which network should be used for this?

I have a 5 fold experiment and I have test and train txt files that on each line contains the corresponding feature vectors.

Say, for test set belonging to fold 0 I have the following two text files:

I’m working together with Mona_Jalal and we’ve tried two different methods for now. The first is to concatenate the two features together and then adding fully connected layers to make the prediction. The second is to first use fully connected layers to make the two features of the same length, and then concatenate the vectors and make the prediction.

I’ve included the code and ideas below and found that they have similar accuracy. Does anyone know the difference between the two? And if there are any other methods for feature combination? Thanks!

1 Like

Could you try to use these features separately first and check the performance of the models?
Also, could you check the feature statistics (mean, min, max) of both feature sets?

After this I would compare the “single feature models” with the combined one and check, if the performance differs a lot or if the combined model might just learn from one feature while ignoring the other.

If you have the feeling (I’m not sure how to check it mathematically other then maybe low weights for the ignored feature) that this might be the case, I would try to add some normalization layers before combining the features, so that the range will be at least closer to each other.

Also, a good baseline using these combined features might be an XGBoost model to compare against (if you can use other methods besides neural networks).

1 Like

Thank you so much for your help! I’ve actually tested the features separately. For Bert, I have an accuracy of 84.42%, and for ResNet, I have an accuracy of 65.58%. After combining them, I have an accuracy of 84.42% for method1 and 84.57% for method2. Does this mean that the images do not actually help during prediction? Thanks!

1 Like

It might mean that and the accuracies could be a pointer in that direction.
Have you checked the feature statistics of both sets?

Are you looking for something like this say for image feature vectors?

[jalal@goku example]$ python 
std is:  [0.2603135  0.72011209 0.4863223  ... 0.18916588 0.30144495 0.22394807] 2
mean is:  [0.2792807  1.14619135 0.63354225 ... 0.2145517  0.32491094 0.2647783 ] 2
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0.27167743 1.03813016 0.59986466 ... 0.31728339 0.06514341 0.12442386]
 [0.13293175 2.18057656 0.36396137 ... 0.27578598 0.1110082  0.05481818]
 [0.13702488 1.23713005 0.26817697 ... 0.06472887 0.00385979 0.00547262]
 [0.36607069 1.04307044 0.42863995 ... 0.36594653 0.22864909 0.10765515]
 [0.05682189 0.90699673 0.3805581  ... 0.06757039 0.04256344 0.01031811]
 [0.13570487 0.58878648 0.37142602 ... 0.32595682 0.0348542  0.06269506]]

The code is:

import numpy as np
from numpy import linalg as LA

infile1 = np.loadtxt('entire_dataset__resnet50_feature_vectors.txt')
std1 = np.std(infile1, 0)
mu1 = np.mean(infile1, 0)

print("std is: ", std1, 2)
print("mean is: ", mu1, 2)



Yes, exactly.
If the statistics differ a lot for both feature sets, the parameters might not be able to get useful information from both inputs.
It’s similar to passing a normalized image after training on unnormalized ones (in [0, 255]).
The normalized image might just “look” like a black image to the model.

I am facing a similar problem. Can you please suggest any examples of normalization layers/code that I can use so that the features from both the models are in a comparable range?

Unfortunately, I don’t know which normalization layer would work the best (and you should certainly try different approaches) or if even a manual scaling would work fine.

1 Like