I have a relatively simple model. I use a base BERT model with 10 linear classifiers on top of it. It looks like this:
def init(self, bert):
super(BERT, self).__init__() self.bert = bert self.fc1 = nn.Linear(768, 768) self.relu = nn.ReLU() self.dropout = nn.Dropout(0.1) self.dropout_head = nn.Dropout(p=0.25) self.classifier_heads = nn.ModuleList([nn.Linear(768, dataset.number_of_classes()) for i in range(10)]) self.output =  def forward(self, **kwargs): cls_hs = self.bert(**kwargs) hidden_state = cls_hs.last_hidden_state pooler = hidden_state[:, 0] x = self.fc1(pooler) x = self.relu(x) x = self.dropout(x) self.output =  for layer in self.classifier_heads: x_out = layer(x) x_out = self.dropout_head(x_out) self.output.append(x_out) return self.output
The model returns a list of length 10, each item contains a tensor of output values (768, 1000 (number of classes)). I want to dynamically update the headers of the model. I hope that the headers of the model will specialize in parts of the data (e.g. outliers, text about images).
My approach was to stack the output and the labels as well to match the output creating a tensor of batch size * number of classes * number of headers. After this, use CrossEntropy WITHOUT reduction, to get the loss per header, resulting in a tensor of batch size * number of headers.
CEloss = nn.CrossEntropyLoss(reduction='none') # During the training loop... preds_stacked = torch.stack(preds, axis=-1) labels_stacked = torch.stack([labels for i in range(len(model.classifier_heads))], axis=-1) losses = CEloss(preds_stacked, labels_stacked)
- Only backpropagate the best loss across all headers;
- Only backpropagate the best loss across that header
For option 1, I simply thought to loop over the batch, select the best loss and backpropagate the loss:
for sample in losses.size: # The batch dimension best_loss = losses[sample].min() best_loss.backward(retain_graph=True) optimizer.step()
However it apparently is not as simple as this. I can of course use a batch size of 1, but this is not very efficient. Maybe someone can offer a solution?
My approach for the second version was to update the optimizer during training.
# Stack predictions and labels preds_stacked = torch.stack(preds, axis=-1) labels_stacked = torch.stack([labels for i in range(len(model.classifier_heads))], axis=-1) # Calculate loss losses = CEloss(preds_stacked, labels_stacked) # losses per head losses_per_head = torch.mean(losses, axis=0) # Backpropagate best losses across headers for header in range(len(model.classifier_heads)): # Update model head to update params = list(model.fc1.parameters()) + list(model.classifier_heads[header].parameters()) optimizer = torch.optim.Adam(params=params, lr=3e-4) optimizer.zero_grad() loss = losses_per_head[header] loss.backward(retain_graph=True) optimizer.step()
However, I get the error that the gradient computation has been modified by an inplace operation. I can create multiple models, but I want to do this in a single model.
I hope someone can offer some suggestions to improve this