Concatenating data with the output of bert

I have model A, which is BERT, and model B, which is also BERT. Each model has different input features.

I removed the classifier layer from each model before constructing a parent model in which I will perform concatenation for three distinct features:

-The output of model A (text)

-The output of model B (POS)

-And other features (subjectivity)

Then I will make it input to the fully connected layer and make the classification

The architecture will be like this

the problem is that the loss too high 218 and the accuracy not improved compared to the results before adding the new features which is not input to any model…

I think I should do normalization but I don’t know how…

This is the parent model:

class ParentModel(torch.nn.Module):

def __init__(self, modelA, modelB):
    super(ParentModel, self).__init__()
    self.modelA = modelA
    self.modelB = modelB
    self.fc1 = torch.nn.Linear(2560, 1536)
    self.classifier = torch.nn.Linear(1536, 3)
def forward(self, x1, x2, x3, x4, x5, x6):
    x11 = self.modelA(x1, x2)
    x22 = self.modelB(x3, x4)
    x =, x22, x5, x6), dim=1)
    x = torch.nn.ReLU()(self.fc1(x))

    x = self.classifier(x)

    return x

ids1 = batch1[‘ids’].to(device, dtype = torch.long)
mask1 = batch1[‘mask’].to(device, dtype = torch.long)
targets1 = batch1[‘targets’].to(device, dtype = torch.long)

ids2 = batch2[‘ids’].to(device, dtype = torch.long)
mask2 = batch2[‘mask’].to(device, dtype = torch.long)

ids3 = batch3[‘ids’].to(device, dtype = torch.long)
mask3 = batch3[‘mask’].to(device, dtype = torch.long)

outputs = model(ids1, mask1, ids2, mask2, ids3, mask3)

Please advise if the problem is not related to the normalization issue.

this is the results:

The results for concatenating just the output of model A and model B is:


Maybe you should share some information more as training parameters such as lr, epochs, optimizer, dropout.

Are you using model BERT from huggingface?

Which are the input features?

Try to expand a bit the question in order to let us better understand :wink:

Anyway I will investigate the learning rate. If it is to high the model do not converge, try to lower it and add some dropout and weights normalization.