Troubles training a model with several inputs, model "fixates" on the most "simple" feature. Any Idea?

I have a classification problem (similar to an NLP classification problem) with several different inputs, for the sake of simplicity I’m trying to handle two inputs right now.

First input is a string of values with constant length for all samples (it’s a nucleotide sequence, which could be looked at as a sentence in some language, consider it as a “complex input”).
The second input is a short list of int values describing the sample (in the NLP analogy it could be values describing “when the sentence was written” and “what is the name of the author”, etc… This is the “simple” input).

The model - I tried different architectures but in all of them the problem still persisted. In the latest architecture I used VDCNN (very deep CNN, which performed very well for an NLP problem) for the first input. For the second input I used a simple linear layer. Then I concatenated between the outputs and feed it to a final linear layer.

The problem - In all of the different architectures the model trained very fast up to 70% accuracy (and a certain loss value) but couldn’t get past this value. Then I tried training the model only on the first input by making the second input to be a constant - the model trained slowly to 60% accuracy, the important fact is that it trained. I tried training the model only with the second input and the model trained very fast to 70% accuracy (in a similar manner that it trained on both of the inputs).

The two inputs are definitely not redundant, meaning two of the inputs should contribute additively to the accuracy of the model.

What I would expect from the model is to train fast to 70% accuracy (learning features from the second input). And then continue improving slowly while learning from the first input.
But it seems that the model learns the most “simple” features from the second input and then completely ignores the first input and doesn’t learn any features from it.

Any ideas how to combat that ?

Could you try to apply Dropout on the concatenated feature layer?
Maybe this forces the model to “look” at both features.

Also, have you checked the value ranges of both inputs to the final layer?
Maybe the range of the VDCNN is quite small compared to the second input and looks like noise?

Thanks for your suggestions.
The dropout layer is a great idea, it’s especially great because I thought about it so now I know I’m not the only one thinking that way. I tried applying dropout to the second input and tried different dropout values (0.1-0.9) but it it didn’t help.

It’s also interesting to check the value range of the VDCNN vs the linear layer output (from the second input). I will check it today.

I also thought about a different solution - train networks seperately for the inputs and then connecting them in a bigger network by concatenating the outputs and feeding them to a linear layer, while training only the last layer.

I would apply Dropout on the whole concatenated feature vector so that both inputs are mixed together.

Pre-training the models also sounds interesting.
However, it wouldn’t really explain the current issue.

Looking forward to hearing about the value ranges!

Thanks for your interest.
I tried applying dropout for the concatenated input as well and separately to each input with different dropout values, it still didn’t fix the issue.
Regarding the values - they indeed were high for the simple feature, to combat this I tried lowering the values and doing some kind of normalization of the raw values. Still, the model didn’t go past 70% accuracy.

Finally I tried the method of training separate models for each input and merging them later, I connected the two models by concatenating the outputs and feeding to a linear layer.
After a short period of training the model got to 70% accuracy again and didn’t improve, I checked the values of the of the outputs of the two models and it was very noticeable that the first output (from the complex features) had low values while the values from the model of “simple features” were high, meaning the combined model clearly “fixated” on the simple features.

Any additional ideas ?
I still think the two features are not redundant and both should contribute to the accuracy.

How did you apply the normalization on the raw values?
By raw values do you mean the input or the features from the sub-models?

For the pre-training strategy. Did you save the sub-model outputs or freeze the sub-models while training the second stage model?
Could you provide the sub-model features as Tensors so that I could have a look?

How was the accuracy of both sub-models separately?

Sorry for the delay.
I tried normalization of the raw input itself (not batch-norm, but normalization to bringing all values to a similar scale).
The first (complex) input is one hot vector (many zeros and a few ones) but the second input is is a bunch of floats with much larger range, I tried normalizing it by simply bringing the values to be between 0 to 1 or -1 to 1.
I also applied batch normalization to the features from the sub models and in between layers in the model.

I tried to train the combined model and the submodels with the same learning rate, tried lowering the learning rate to the sub models and tried completely freezing the submodels.
My colleague suggested to pre-train a model on the complex input, then “freeze” the weights to it and add the information of the second “simple” input and train the network very slowly.

I did some additional testing of (some different) submodels and I got those numbers:
We have one class that we are interested in it’s predictions so the accuracy is for that class;
First model for the complex input gets 46% of the samples classified correctly.
Second model for the simple input gets 54% of samples correctly.
Both models gets 37% of the samples classified correctly.
Meaning - there are 9% of samples that gets classified correctly by the first model but not the second
and there are 17% of samples classified correctly by the second model but not the first.

Regarding the inputs - I tried to take the outputs from softmax after both models (didn’t work), then I tried removing the softmax and got those values:
model 1 on complex input
Variable containing:
4.5105e-01 5.4895e-01
6.3620e-01 3.6380e-01
9.7610e-01 2.3901e-02
9.2572e-01 7.4278e-02
2.9720e-01 7.0280e-01
6.0810e-01 3.9190e-01
7.6838e-01 2.3162e-01
3.7808e-01 6.2192e-01
5.3675e-01 4.6325e-01

model 2 on simple input
Variable containing:
0.4752 -0.8468
-0.9316 0.6881
-0.4031 0.2606
0.2582 -0.1270
2.4042 -2.7847
-0.8459 0.6637
-0.3613 0.3322
-0.4756 0.5486
0.6900 -0.7750

After that I tried to delete the last FC layers of both sub models and concatenate the outputs, outputs looked like this:
model 1 on complex input
Variable containing:
4.7969e-04 4.7969e-04 4.7969e-04 … 4.7969e-04 4.7969e-04 4.7969e-04
4.8490e-04 4.8490e-04 4.8490e-04 … 4.8490e-04 4.8490e-04 4.8490e-04
3.1762e-04 3.1762e-04 3.3527e-04 … 3.1762e-04 3.1762e-04 3.1762e-04
… ⋱ …
4.7948e-04 4.7948e-04 4.7948e-04 … 4.7948e-04 4.7948e-04 4.7948e-04
2.6548e-04 2.6548e-04 2.6548e-04 … 2.6548e-04 2.6548e-04 2.6548e-04
3.1191e-04 3.1191e-04 3.1191e-04 … 3.1191e-04 3.1191e-04 3.1191e-04
[torch.cuda.FloatTensor of size 256x2048 (GPU 0)]

model 2 on simple input
Variable containing:
1.00000e-02 *
0.0404 0.0404 0.0404 … 0.0431 0.0404 0.0521
0.0471 0.0418 0.0418 … 0.0418 0.0428 0.0675
0.0381 0.0381 0.0849 … 0.0381 0.0381 0.0425
… ⋱ …
0.0371 0.0649 0.0371 … 0.0371 0.0594 0.0380
0.0542 0.0415 0.0415 … 0.0666 0.0577 0.0422
0.0189 0.0255 0.0255 … 0.0189 0.0683 0.0189
[torch.cuda.FloatTensor of size 256x2048 (GPU 0)]

I thought the the linear layers I put after concatenating the two outputs from the submodels would handle the different scale of values of the two submodels.

Thanks for the information!

How do you process the complex input? Are you using an Embedding layer?

The output of the concatenated outputs looks a bit redundant to me.
For example, for model1 and the complex input, these seem to be a lot of repetitions for each sample.

For the complex input I used CNNs followed by an RNN and finally a few linear layers(like a few other research papers in this field). The output you see here:

model 1 on complex input
Variable containing:
4.7969e-04 4.7969e-04 4.7969e-04 … 4.7969e-04 4.7969e-04 4.7969e-04
4.8490e-04 4.8490e-04 4.8490e-04 … 4.8490e-04 4.8490e-04 4.8490e-04

is after CNN layers, RNN and one FC layer.
I tried using embedding but the results were worse for some reason (it didn’t train or trained really slow). Do you think I should give embedding another chance ?

Now that you’ve mentioned it, it does look redundant, is there something I can learn from this ? (and by the way it’s before the concatenation… I concatenate this output with the one I pasted after it and then pass it to two additional FC layers.

To conclude, I have two models that predict the same thing based on two different features, one model is better for some samples and other model is better with other samples.
So, either I need to find a way to train one big model on all features that won’t “fixate” on one feature or connect those two models. I can connect and hope deep learning will successfully use the accuracy of each model, or manually decide which model to use (I might try training another model simply to decide which model to use for each sample, maybe it will give me some interesting results)

Any additional suggestions or opinions are welcome.