Model predicts only one class

Hi. I have a question if could somebody find some mistakes in my training and validation loops. For me everything looks fine, but after second iteration (after first epoch looks normal), during the validation every image from validation dataset is predicted as an element of the same class (let’s say that I am trying to classify to one of four classes, so everything is predicted as 1 or f.e. 4).
It’s based on RCNN model.

My training function:

def train(train_loader, model, optimizer, epoch, device):
              model.train()
              loss_monitor = AverageMeter()
              with tqdm(train_loader) as _tqdm:
              for x, y in _tqdm:
                       x = x.to(device)
                       y = y.to(device)

                       outputs = model(x, y)

                       loss = outputs["loss_classifier"]  

                       optimizer.zero_grad()
                       (outputs["loss_classifier"]).backward()
                       optimizer.step()

              return loss  # I know it's unnecessary 

Validation function:

def validate(val_loader, model, epoch, device):
model.eval()
preds = []
gt = []
with torch.no_grad():
    with tqdm(val_loader) as _tqdm:
        for x, y in _tqdm:
            x = x.to(device)
            y = y.to(device) 
            gt.append(y["class"].cpu().numpy())

            outputs = model(x, y)

            for output in outputs:
                pred = F.softmax(output["age"], dim=-1).cpu().numpy()
                pred = (pred * np.arange(0, pred.size)).sum(axis=-1)
                preds.append(np.array([pred]))
            _tqdm.set_postfix(OrderedDict(stage="val", epoch=epoch),)

mae = calculate_mae(gt, preds)  # function to calculate mae (classes are inindependance [class 1 is closer to 2 then to 3]) 
f1 = calculate_f1(gt, preds)  # function to calculate mae
return mae, f1

Main loop:

model = PornRCNN.create_resnet_50()

optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)

model = model.to(device)
model.set_age_loss_fn(loss_classifier)

scheduler = StepLR(
    optimizer, step_size=0.0001, gamma=0.2, last_epoch=start_epoch - 1,
)

best_val_f1 = 0

for epoch in range(start_epoch, num_epoch):
    train_loss = train(train_loader, model, optimizer, epoch, device)
    mae, f1 = validate(val_loader, model, epoch, device)

    if f1 > best_val_f1:
        model_state_dict = model.state_dict()
    
        best_val_f1 = f1
 
    scheduler.step()

Any ideas why it works like I said? Do you have a tips how to do it better?
I should add that the loss in training mode decrease normally, so that’s not a problem.

Two things:

  1. What is happening in this block?
  2. Have you trained your network for a sufficient number of epochs? Calculate the training f1 and mae to ensure the model is training as expected.

At first thanks for your reply.

  1. This block just changes the format of prediction to be exactly the same like format of gt.
  2. I trained my net for even 20-30 epochs but predictions only from first epoch looked kinda normal (weren’t good, but had good distribution). Since second epoch the results were always the same (f1 didn’t change and it always predicted one of the class for every element). I will check f1 and mae for training soon. Do you have any others suggestions?

I actually don’t know how to check mae/f1 during training because because outputs in rcnn training mode doesn’t return predictions. Do anyone know how to do it?

If I understand it correctly, simply passing train_loader in place of val_loader should give you the mae and f1 for training dataset.

The problem is the same :confused: Maybe you can see something wrong in the code lines?

If you’re facing the same issue with the training data, it implies your model is not learning.
Share the full latest code (model, training, validation logic etc).