Delete this post please

xzblueboy · March 16, 2023, 3:37am

eqy · March 16, 2023, 3:40am

I do not see a zero_grad call on the optimizer, is that also part of the training loop? Additionally, some more details such as whether the accuracy is plateauing or never changing would help with debugging.

xzblueboy · March 16, 2023, 3:41am

Yes earlier in the training loop there is a zero_grad call on the optimizer. The accuracy reaches a certain threshold (plateauing) sometimes it dips below the threshold but never goes above.

eqy · March 16, 2023, 3:42am

What kind of learning rate schedule is being used and is it possible that the learning rate has decayed to a very small value.

xzblueboy · March 16, 2023, 3:43am

I’m keeping the learning rate the same and passing it in as a hyper parameter

J_Johnson · March 16, 2023, 5:20am

Can you provide enough of your training loop to reproduce the issue? Can pass in a “dummy dataset” defined before the loop for overfitting, for example:

dummy_inputs=torch.rand(input_dims)
dummy_targets=torch.empty(target_dims).random_(2)

xzblueboy · March 16, 2023, 5:32am

Here is the whole training loop:

The accuracy for the other two aren’t plateauing at the moment only the para.

for epoch in range(args.epochs):
        model.train()
        train_loss = torch.zeros(3)
        num_batches = 0
        for sst_batch, sts_batch, para_batch in zip(
                                                    tqdm(
                                                        sst_train_dataloader, desc=f"train-sst-{epoch}", disable=TQDM_DISABLE
                                                    ),
                                                    tqdm(
                                                        sts_train_dataloader, desc=f"train-sts-{epoch}", disable=TQDM_DISABLE
                                                    ),
                                                    tqdm(
                                                        para_train_dataloader, desc=f"train-para-{epoch}", disable=TQDM_DISABLE
                                                    )):
            # sentiment
            sst_b_ids, sst_b_mask, sst_b_labels = (sst_batch['token_ids'],
                                                   sst_batch['attention_mask'], sst_batch['labels'])

            sst_b_ids = sst_b_ids.to(device)
            sst_b_mask = sst_b_mask.to(device)
            sst_b_labels = sst_b_labels.to(device)

            optimizer.zero_grad()
            sst_logits = model.predict_sentiment(sst_b_ids, sst_b_mask)
            sst_loss = F.cross_entropy(sst_logits, sst_b_labels.view(-1), reduction='sum') / sst_b_labels.size()[0]

            train_loss[0] += sst_loss.item()
            sst_loss.backward()
            optimizer.step()

            ### para loss
            para_b_ids, para_b_mask, para_b_ids_2, para_b_mask_2, para_b_labels = (para_batch['token_ids_1'],
                                                                                   para_batch['attention_mask_1'],
                                                                                   para_batch['token_ids_2'],
                                                                                   para_batch['attention_mask_2'],
                                                                                   para_batch['labels'])

            para_b_ids = para_b_ids.to(device)
            para_b_mask = para_b_mask.to(device)
            para_b_ids_2 = para_b_ids_2.to(device)
            para_b_mask_2 = para_b_mask_2.to(device)
            para_b_labels = para_b_labels.to(device)

            optimizer.zero_grad()
            para_logit = model.predict_paraphrase(para_b_ids, para_b_mask, para_b_ids_2, para_b_mask_2)
            para_loss = F.binary_cross_entropy(torch.Tensor.float(torch.reshape(para_logit, (-1,))), para_b_labels.float()) / para_b_labels.size()[0]

            train_loss[1] += para_loss.item()
            para_loss.backward()
            optimizer.step()
            # similarity
            sts_b_ids, sts_b_mask, sts_b_ids_2, sts_b_mask_2, sts_b_labels = (sts_batch['token_ids_1'],
                                                                              sts_batch['attention_mask_1'],
                                                                              sts_batch['token_ids_2'],
                                                                              sts_batch['attention_mask_2'],
                                                                              sts_batch['labels'])
            sts_b_ids = sts_b_ids.to(device)
            sts_b_mask = sts_b_mask.to(device)
            sts_b_ids_2 = sts_b_ids_2.to(device)
            sts_b_mask_2 = sts_b_mask_2.to(device)
            sts_b_labels = sts_b_labels.to(device)

            optimizer.zero_grad()
            sts_logit = model.predict_similarity(sts_b_ids, sts_b_mask, sts_b_ids_2, sts_b_mask_2)
            sts_loss = F.mse_loss(sts_logit.float(), sts_b_labels.float(), reduction="mean")
            train_loss[2] += sts_loss.item()
            sts_loss.backward()
            optimizer.step()

J_Johnson · March 16, 2023, 5:58am

Here is an example of a fully reproducible code snippet:

#call imports, especially if using any external libraries
import torch
import torch.nn as nn

#define the model
class CNN(nn.Module):
    def __init__(self, hidden_dim, num_classes):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Conv2d(3, hidden_dim, kernel_size=(3,3), bias=False),
            nn.BatchNorm2d(hidden_dim),
            nn.ReLU(),
            nn.MaxPool2d(2,2))
        self.avgpool=nn.AdaptiveAvgPool2d(2)
        self.fc_out=nn.Linear(4*hidden_dim, num_classes)

    def forward(self, x):
        x = self.layers(x)
        x = self.avgpool(x)
        x = x.view(-1, 32*4)
        return self.fc_out(x)

model=CNN(32, 10) #set the model to something small so it will run quick

#define the loss and optimizer functions
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

#create dummy data
dummy_input=torch.rand((4, 3, 224, 224))
dummy_targets = torch.empty(4,10).random_(2)

#same train loop except use the dummy inputs
while True:
    optimizer.zero_grad()
    output=model(dummy_input)

    loss = criterion(output, dummy_targets)

    loss.backward()
    optimizer.step()

    # show how you're calculating accuracy
    accuracy = (1-torch.abs(torch.round(nn.functional.sigmoid(output.detach()))-dummy_targets)).mean()
    print("Loss", loss.item(), "Accuracy", accuracy.item())

xzblueboy · March 16, 2023, 6:02am

Sorry if I’m being dense but I would replace the CNN in your example with my model and then test from there?

J_Johnson · March 16, 2023, 6:04am

Correct. And the train loop and other relevant details. And just make sure it runs without any other errors.