I’ve encountered some strange behavior in my model’s training outputs. It seems that the steps the optimizer takes do not help with convergence, and in many cases do the opposite of what the loss function would entail. For the sake of testing, I make all labels of the dataset 0.86 (arbitrary value) to see if the model can at least predict a constant value. Using the MSELoss here are the outputs at each training step :

```
Loss | Model Output. | Label
tensor(0.3838, grad_fn=<MseLossBackward0>) tensor([[0.2405]], grad_fn=<MeanBackward1>) tensor([[0.8600]])
tensor(0.2788, grad_fn=<MseLossBackward0>) tensor([[0.3320]], grad_fn=<MeanBackward1>) tensor([[0.8600]])
tensor(0.1464, grad_fn=<MseLossBackward0>) tensor([[0.4774]], grad_fn=<MeanBackward1>) tensor([[0.8600]])
tensor(0.0539, grad_fn=<MseLossBackward0>) tensor([[0.6278]], grad_fn=<MeanBackward1>) tensor([[0.8600]])
tensor(0.0121, grad_fn=<MseLossBackward0>) tensor([[0.7500]], grad_fn=<MeanBackward1>) tensor([[0.8600]])
tensor(0.0006, grad_fn=<MseLossBackward0>) tensor([[0.8354]], grad_fn=<MeanBackward1>) tensor([[0.8600]])
tensor(0.0009, grad_fn=<MseLossBackward0>) tensor([[0.8908]], grad_fn=<MeanBackward1>) tensor([[0.8600]])
tensor(0.0043, grad_fn=<MseLossBackward0>) tensor([[0.9257]], grad_fn=<MeanBackward1>) tensor([[0.8600]])
tensor(0.0077, grad_fn=<MseLossBackward0>) tensor([[0.9475]], grad_fn=<MeanBackward1>) tensor([[0.8600]])
tensor(0.0104, grad_fn=<MseLossBackward0>) tensor([[0.9619]], grad_fn=<MeanBackward1>) tensor([[0.8600]])
tensor(0.0124, grad_fn=<MseLossBackward0>) tensor([[0.9713]], grad_fn=<MeanBackward1>) tensor([[0.8600]])
tensor(0.0138, grad_fn=<MseLossBackward0>) tensor([[0.9776]], grad_fn=<MeanBackward1>) tensor([[0.8600]])
tensor(0.0149, grad_fn=<MseLossBackward0>) tensor([[0.9820]], grad_fn=<MeanBackward1>) tensor([[0.8600]])
```

The same thing also happens when changing the loss to L1Loss:

```
Loss | Model Output. | Label
tensor(0.8063, grad_fn=<SumBackward0>) tensor([[0.0537]], grad_fn=<MeanBackward1>) tensor([[0.8600]])
tensor(0.7222, grad_fn=<SumBackward0>) tensor([[0.1378]], grad_fn=<MeanBackward1>) tensor([[0.8600]])
tensor(0.5698, grad_fn=<SumBackward0>) tensor([[0.2902]], grad_fn=<MeanBackward1>) tensor([[0.8600]])
tensor(0.3777, grad_fn=<SumBackward0>) tensor([[0.4823]], grad_fn=<MeanBackward1>) tensor([[0.8600]])
tensor(0.1891, grad_fn=<SumBackward0>) tensor([[0.6709]], grad_fn=<MeanBackward1>) tensor([[0.8600]])
tensor(0.0419, grad_fn=<SumBackward0>) tensor([[0.8181]], grad_fn=<MeanBackward1>) tensor([[0.8600]])
tensor(0.0509, grad_fn=<SumBackward0>) tensor([[0.9109]], grad_fn=<MeanBackward1>) tensor([[0.8600]])
tensor(0.0926, grad_fn=<SumBackward0>) tensor([[0.9526]], grad_fn=<MeanBackward1>) tensor([[0.8600]])
tensor(0.1121, grad_fn=<SumBackward0>) tensor([[0.9721]], grad_fn=<MeanBackward1>) tensor([[0.8600]])
tensor(0.1224, grad_fn=<SumBackward0>) tensor([[0.9824]], grad_fn=<MeanBackward1>) tensor([[0.8600]])
```

Here the steps the optimizer is taking is not conducive to lowering the loss, but instead moves in the direction of converging to 1.

my model is defined as follows :

```
class DualBertForClassification(nn.Module):
def __init__(self, bert_model_a, bert_model_b):
super(DualBertForClassification, self).__init__()
self.bert_model_wt = bert_model_a
self.bert_model_mutant = bert_model_b
self.layer_1 = nn.Linear(1024, 512)
self.layer_2 = nn.Linear(512, 128)
self.layer_3 = nn.Linear(128, 16)
self.layer_4 = nn.Linear(16, 1)
def forward(self, x):
x_a = x[0]
x_b = x[1]
x = torch.cat(
(
self.bert_model_wt(**x_a).last_hidden_state, # [batch_sz, sequence_sz, 1024]
self.bert_model_mutant(**x_b).last_hidden_state # [batch_sz, sequence_sz, 1024]
),
1
) # [batch_sz, 2*sequence_sz, 1024, ]
x = torch.tanh(self.layer_1(x)) # [batch_sz, 2*sequence_sz, 512, ]
x = torch.tanh(self.layer_2(x)) # [batch_sz, 2*sequence_sz, 128, ]
x = torch.tanh(self.layer_3(x)) # [batch_sz, 2*sequence_sz, 16, ]
x = torch.tanh(self.layer_4(x)) # [batch_sz, 2*sequence_sz, 1, ]
x = torch.mean(x, dim = 1) # [batch_sz, 1]
return x
```

bert_model_a , bert_model_b are both pretrained identical BERT models. Whose output sizes are shown in the comments. Here is also the training loop which produced the printed output shown above.

```
model1 = BertModel.from_pretrained(model_name)
model2 = BertModel.from_pretrained(model_name)
model = DualBertForClassification(model1, model2)
optimizer = torch.optim.SGD(model.parameters(), lr=0.05, momentum=0.9)
loss_fct = nn.L1Loss(reduction="sum") # also with nn.MSELoss(reduction = "sum")
model.train()
for index, (input_a, input_b) in enumerate(zip(wt_inputs, alt_inputs)):
label_val = torch.tensor([[0.86]], dtype = torch.float32)
model_output = model((input_a, input_b))
loss = loss_fct(model_output, label_val)
print(loss, model_output, label_val)
loss.backward()
optimizer.step()
optimizer.zero_grad()
```

My thoughts are that this could be an issue with the way autograd is handling the two input BERT models, but I still cannot think of an explanation that could explain the growing loss over time. Any help is greatly appreciated, thank you