Loopng over Submodel?

Okay, what I’m trying to do is use an existing pytorch Model (BERT) in my own custom network.

If you didn’t know, BERT works on input that is of shape (512).

As my input data is much larger than that, I’ve split each example into lengths of 12, with at most 4 lengths for each example. So for example, in my custom network, with a batch size of 32, the input shape would be (32, 4, 512).

So I’ve included a pretrained bert model in my network in the first layer, and in the forward function I loop over each of the four 512-length chunks, pass them into bert, get the output back, and stack the output back together. The output of a single bert pass is 768, therefore after the looping I now have a batch shape of (32, 4, 768).

What I’m struggling with is that my model simply isn’t converging, could it be that the backwards pass isn’t propagating to the Bert loop properly?

Could you check, if your BERT model gets valid gradients after the loss.backward() call?
You can print the .grad attributes of some layers or of all parameters:

...
loss.backward()
print(model.bert_model.some_layer.weight.grad)
# or
for name, param in model.named_parameters():
    print(name, param.grad)

If you don’t see any gradients, this would mean that the computation graph was detached somewhere.

I’ve looked at the rad.bert.bert.encoder.layer[10].intermediate.dense.weight.grad values and they do change.

Unsure really why else the model isn’t converging.

If it’s any help here’s the rest of it:

class ClassificationBert(nn.Module):

    # AS input all our data is shaped so that it is (documents, segments)
    # that is, we have multiple segments per document.
    def __init__(self, bert, labels):
        super(ReadmissionBert, self).__init__()

        self.num_labels = labels

        self.bert = bert

        self.linear = nn.Linear(3072, self.num_labels)

        self.sigmoid = nn.Sigmoid()
    def forward(self, text, labels):
        # We loop over all the sequences to get the bert representaions
        pooled_layer_output = []
        for i in range(len(text)):
            bert_outputs = []
            for j in range(len(text[i])):
                bert_out = self.bert(text[i][j].unsqueeze(0))

                bert_outputs.append(bert_out)

            bs = torch.stack(bert_outputs).view(-1)
            pooled_layer_output.append(bs)

            # Flatten the input so that we have a single dimension for all the bert pooled layer.
        pooled_layer_output = torch.stack(pooled_layer_output)

        logits = self.linear(pooled_layer_output) #We only use the output of the last hidden layer.

        logits = self.sigmoid(logits)
        outputs = (logits,)  # add hidden states and attention if they are here

        if labels is not None:
            loss_fct = CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            outputs = (loss,) + outputs

        return outputs

So you can see that I pass data into bert in a loop.

The loop shouldn’t be the problem here.
Could you try to overfit your model on a small subset of your data, e.g. just 10 samples?
If your model cannot overfit these samples nearly perfectly, some other bugs might be in the code.

Yeah its unable to fit the data. Just an idea, but as I call BERT in a loop during the forward pass 4 times, Could the gradient being passed to bert be calculated four times during the backward pass?So that for each backward pass it’s amplifying the gradient 4x?

That could be one reason.
Also, I’m a bit sceptical about these lines of code:

bs = torch.stack(bert_outputs).view(-1)
...
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
outputs = (loss,) + outputs

Could you print the shape of bs, pooled_layer_output and logits?
Also, what is the last line of code doing?

bs.shape: torch.Size([3072])

pooled_layer_shape: torch.Size([2, 3072])

logits shape: torch.Size([2, 2])

The last line of code calculates the loss, and then returns it as a tuple (loss, logits)

Thanks for the info.
Your batch size should be 32 as described in the first post.
Reshaping the activations to [2, 3072] seems to be wrong or am I missing something?

Apologies that was just an example. I do pass in batches of size two as that’s pretty much the limit of the computing power available and instead emulate it with 32 batches. So it IS a batch of 2.

Thanks for the information and sorry for missing the probably obvious bug.
nn.CrossEntropyLoss expects raw logits, so could you remove the logits = self.sigmoid(logits) and try to overfit the small data sample again?

Nothing seems to happen, the loss just bounces around the 0.69 mark and I get AUROC of roughly 0.44 on the training data.

Not quite sure all the steps I took, but I removed the sigmoid function and replaced the loss calculation:

loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

With this:

loss_fct = CrossEntropyLoss()
loss = loss_fct(logits, labels.view(-1))

Now it has overfitted extremely well and all the metrics it has outputted are making sense and I get the output below:

{'train': {'loss': 0.04647320210933685, 'accuracy': 0.7676767676767676, 'roc': 1.0}, 'test': {'loss': 0.7614774504303932, 'accuracy': 0.5555555555555556, 'roc': 0.6168539325842697}}

So thank you very much for your advice!