# Model outputs converge to single value

I’ve created a model which takes as input a (tokenized) sequence of length n and predicts a sequence of 0-1 probabilities for each of the n tokens. e.g. [0.0, 0.0, 0.3, 0.72, … , 0.0]. The model is set up to take the output of a pretrained BERT model (size of [batch_sz, 1024, sequence_len]) and feed its output into a group of fully connected layers as such:

``````class MyModel(nn.Module):
def __init__(self, bert_model):
super(MyModel, self).__init__()
self.bert_model = bert_model
self.linear_1 = nn.Linear(1024, 512)
self.linear_2 = nn.Linear(512, 128)
self.linear_3 = nn.Linear(128, 64)
self.linear_4 = nn.Linear(64, 1)

x = F.relu(self.linear_1(x))
x = F.relu(self.linear_2(x))
x = F.relu(self.linear_3(x))
x = torch.flatten(torch.sigmoid(self.linear_4(x)))
return x
``````

What I’m finding, however is when evaluating, all outputs for the sequences are the same value with very slight differences. The model does in fact learn, and it does so rather quickly, but also quickly converges to these type of predictions:

This trend continues long after step 175 shown in the above image. When evaluating (every 5000 steps on a test dataset) on a dataset of 1000 unique sequences, varying lengths (around ~500) on average we see that the model has learned to predict a single value for every position in the output (e.g. [0.07, 0.07, … ,0.07]) for every sequence evaluated:

`````` trianing_steps,  maximum_prediction, minimum_prediction, difference
5000              0.078548           0.078548        8.94E-08
10000             0.079725           0.079725        7.45E-08
15000             0.082846           0.082846        7.45E-08
20000             0.082651           0.082651        7.45E-08
25000             0.067803           0.067803        3.73E-08

``````

Examining the predictions in the training loop, we can see that this phenomenon occurs even in the training dataset. It’s not always the case, as predictions in the beginning of the training loop do in fact differ (by up to 50% in some cases), but converge to all being nearly identical as time goes on. This leads me to believe that the data processing is correct and the issue is either in the way I’m using the loss function, or in the way the model is set up. The training loop is written as follows:

``````protbert_model = BertModel.from_pretrained(model_name)
model = MyModel(protbert_model)
model.to(device)

optimizer = AdamW(model.parameters(), lr = 0.00001)
loss_fct = nn.MSELoss(reduction="sum")

model.train()
labels = torch.from_numpy(np.asarray(batch["labels"], dtype=np.float32)).cuda()
inputs = batch["input_ids"].cuda()

# making the outputs and labels 0 shouldn't affect loss value
# due to reduction = "sum"
loss_mask = torch.where(labels == -100.0, 0, 1).cuda()

#creating mask to exclude additional tokens we don't want in our calculation
domains = np.asarray(batch["domains"])
# making the outputs and labels 0 shouldn't affect loss value
# due to reduction = "sum"
domains_mask = np.where(domains == -1, 0, 1)