Model predicts only one class

Tomash · November 10, 2020, 12:29am

Hi. I have a question if could somebody find some mistakes in my training and validation loops. For me everything looks fine, but after second iteration (after first epoch looks normal), during the validation every image from validation dataset is predicted as an element of the same class (let’s say that I am trying to classify to one of four classes, so everything is predicted as 1 or f.e. 4).
It’s based on RCNN model.

My training function:

def train(train_loader, model, optimizer, epoch, device):
              model.train()
              loss_monitor = AverageMeter()
              with tqdm(train_loader) as _tqdm:
              for x, y in _tqdm:
                       x = x.to(device)
                       y = y.to(device)

                       outputs = model(x, y)

                       loss = outputs["loss_classifier"]  

                       optimizer.zero_grad()
                       (outputs["loss_classifier"]).backward()
                       optimizer.step()

              return loss  # I know it's unnecessary

Validation function:

def validate(val_loader, model, epoch, device):
model.eval()
preds = []
gt = []
with torch.no_grad():
    with tqdm(val_loader) as _tqdm:
        for x, y in _tqdm:
            x = x.to(device)
            y = y.to(device) 
            gt.append(y["class"].cpu().numpy())

            outputs = model(x, y)

            for output in outputs:
                pred = F.softmax(output["age"], dim=-1).cpu().numpy()
                pred = (pred * np.arange(0, pred.size)).sum(axis=-1)
                preds.append(np.array([pred]))
            _tqdm.set_postfix(OrderedDict(stage="val", epoch=epoch),)

mae = calculate_mae(gt, preds)  # function to calculate mae (classes are inindependance [class 1 is closer to 2 then to 3]) 
f1 = calculate_f1(gt, preds)  # function to calculate mae
return mae, f1

Main loop:

model = PornRCNN.create_resnet_50()

optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)

model = model.to(device)
model.set_age_loss_fn(loss_classifier)

scheduler = StepLR(
    optimizer, step_size=0.0001, gamma=0.2, last_epoch=start_epoch - 1,
)

best_val_f1 = 0

for epoch in range(start_epoch, num_epoch):
    train_loss = train(train_loader, model, optimizer, epoch, device)
    mae, f1 = validate(val_loader, model, epoch, device)

    if f1 > best_val_f1:
        model_state_dict = model.state_dict()
    
        best_val_f1 = f1
 
    scheduler.step()

Any ideas why it works like I said? Do you have a tips how to do it better?
I should add that the loss in training mode decrease normally, so that’s not a problem.

Abhilash_Srivastava · November 10, 2020, 5:29am

Two things:

Tomash:

                pred = F.softmax(output["age"], dim=-1).cpu().numpy()
                pred = (pred * np.arange(0, pred.size)).sum(axis=-1)
                preds.append(np.array([pred]))

What is happening in this block?
Have you trained your network for a sufficient number of epochs? Calculate the training f1 and mae to ensure the model is training as expected.

Tomash · November 10, 2020, 9:08am

At first thanks for your reply.

This block just changes the format of prediction to be exactly the same like format of gt.
I trained my net for even 20-30 epochs but predictions only from first epoch looked kinda normal (weren’t good, but had good distribution). Since second epoch the results were always the same (f1 didn’t change and it always predicted one of the class for every element). I will check f1 and mae for training soon. Do you have any others suggestions?

Tomash · November 12, 2020, 9:35am

I actually don’t know how to check mae/f1 during training because because outputs in rcnn training mode doesn’t return predictions. Do anyone know how to do it?

Abhilash_Srivastava · November 12, 2020, 8:16pm

If I understand it correctly, simply passing train_loader in place of val_loader should give you the mae and f1 for training dataset.

Tomash · November 12, 2020, 8:25pm

The problem is the same Maybe you can see something wrong in the code lines?

Abhilash_Srivastava · November 12, 2020, 11:02pm

If you’re facing the same issue with the training data, it implies your model is not learning.
Share the full latest code (model, training, validation logic etc).