Delete this post please

Delete this post please

I do not see a zero_grad call on the optimizer, is that also part of the training loop? Additionally, some more details such as whether the accuracy is plateauing or never changing would help with debugging.

Yes earlier in the training loop there is a zero_grad call on the optimizer. The accuracy reaches a certain threshold (plateauing) sometimes it dips below the threshold but never goes above.

What kind of learning rate schedule is being used and is it possible that the learning rate has decayed to a very small value.

I’m keeping the learning rate the same and passing it in as a hyper parameter

Can you provide enough of your training loop to reproduce the issue? Can pass in a “dummy dataset” defined before the loop for overfitting, for example:

dummy_inputs=torch.rand(input_dims)
dummy_targets=torch.empty(target_dims).random_(2)

Here is the whole training loop:

The accuracy for the other two aren’t plateauing at the moment only the para.

for epoch in range(args.epochs):
        model.train()
        train_loss = torch.zeros(3)
        num_batches = 0
        for sst_batch, sts_batch, para_batch in zip(
                                                    tqdm(
                                                        sst_train_dataloader, desc=f"train-sst-{epoch}", disable=TQDM_DISABLE
                                                    ),
                                                    tqdm(
                                                        sts_train_dataloader, desc=f"train-sts-{epoch}", disable=TQDM_DISABLE
                                                    ),
                                                    tqdm(
                                                        para_train_dataloader, desc=f"train-para-{epoch}", disable=TQDM_DISABLE
                                                    )):
            # sentiment
            sst_b_ids, sst_b_mask, sst_b_labels = (sst_batch['token_ids'],
                                                   sst_batch['attention_mask'], sst_batch['labels'])

            sst_b_ids = sst_b_ids.to(device)
            sst_b_mask = sst_b_mask.to(device)
            sst_b_labels = sst_b_labels.to(device)

            optimizer.zero_grad()
            sst_logits = model.predict_sentiment(sst_b_ids, sst_b_mask)
            sst_loss = F.cross_entropy(sst_logits, sst_b_labels.view(-1), reduction='sum') / sst_b_labels.size()[0]

            train_loss[0] += sst_loss.item()
            sst_loss.backward()
            optimizer.step()

            ### para loss
            para_b_ids, para_b_mask, para_b_ids_2, para_b_mask_2, para_b_labels = (para_batch['token_ids_1'],
                                                                                   para_batch['attention_mask_1'],
                                                                                   para_batch['token_ids_2'],
                                                                                   para_batch['attention_mask_2'],
                                                                                   para_batch['labels'])

            para_b_ids = para_b_ids.to(device)
            para_b_mask = para_b_mask.to(device)
            para_b_ids_2 = para_b_ids_2.to(device)
            para_b_mask_2 = para_b_mask_2.to(device)
            para_b_labels = para_b_labels.to(device)

            optimizer.zero_grad()
            para_logit = model.predict_paraphrase(para_b_ids, para_b_mask, para_b_ids_2, para_b_mask_2)
            para_loss = F.binary_cross_entropy(torch.Tensor.float(torch.reshape(para_logit, (-1,))), para_b_labels.float()) / para_b_labels.size()[0]

            train_loss[1] += para_loss.item()
            para_loss.backward()
            optimizer.step()
            # similarity
            sts_b_ids, sts_b_mask, sts_b_ids_2, sts_b_mask_2, sts_b_labels = (sts_batch['token_ids_1'],
                                                                              sts_batch['attention_mask_1'],
                                                                              sts_batch['token_ids_2'],
                                                                              sts_batch['attention_mask_2'],
                                                                              sts_batch['labels'])
            sts_b_ids = sts_b_ids.to(device)
            sts_b_mask = sts_b_mask.to(device)
            sts_b_ids_2 = sts_b_ids_2.to(device)
            sts_b_mask_2 = sts_b_mask_2.to(device)
            sts_b_labels = sts_b_labels.to(device)

            optimizer.zero_grad()
            sts_logit = model.predict_similarity(sts_b_ids, sts_b_mask, sts_b_ids_2, sts_b_mask_2)
            sts_loss = F.mse_loss(sts_logit.float(), sts_b_labels.float(), reduction="mean")
            train_loss[2] += sts_loss.item()
            sts_loss.backward()
            optimizer.step()

Here is an example of a fully reproducible code snippet:

#call imports, especially if using any external libraries
import torch
import torch.nn as nn

#define the model
class CNN(nn.Module):
    def __init__(self, hidden_dim, num_classes):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Conv2d(3, hidden_dim, kernel_size=(3,3), bias=False),
            nn.BatchNorm2d(hidden_dim),
            nn.ReLU(),
            nn.MaxPool2d(2,2))
        self.avgpool=nn.AdaptiveAvgPool2d(2)
        self.fc_out=nn.Linear(4*hidden_dim, num_classes)

    def forward(self, x):
        x = self.layers(x)
        x = self.avgpool(x)
        x = x.view(-1, 32*4)
        return self.fc_out(x)

model=CNN(32, 10) #set the model to something small so it will run quick

#define the loss and optimizer functions
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

#create dummy data
dummy_input=torch.rand((4, 3, 224, 224))
dummy_targets = torch.empty(4,10).random_(2)

#same train loop except use the dummy inputs
while True:
    optimizer.zero_grad()
    output=model(dummy_input)

    loss = criterion(output, dummy_targets)

    loss.backward()
    optimizer.step()

    # show how you're calculating accuracy
    accuracy = (1-torch.abs(torch.round(nn.functional.sigmoid(output.detach()))-dummy_targets)).mean()
    print("Loss", loss.item(), "Accuracy", accuracy.item())

Sorry if I’m being dense but I would replace the CNN in your example with my model and then test from there?

Correct. And the train loop and other relevant details. And just make sure it runs without any other errors.