Model always predicts the same class

Hi guys!
I try to train a classifier based on COCO dataset. I have few classes and after every epoch I am checking f1 and mae. The problem is that always after one iteration output values are kinda random and later (after 2-5 iterations) always every sample is classified as the same class (for example when in the batch I have 100 elements in 100 classes as a result I will predict 10000 elements of class 0). I can’t see any mistakes in my code. My loss function works fine so where is the problem?
Like I said it’s based on COCO dataset so maybe I chose wrong learning rate or optimizer? I will be glad for any help or suggestions.

def train(train_loader, model, optimizer, epoch, device):
    loss_monitor = AverageMeter()

    lr_scheduler = None
    if epoch == 0:
        warmup_factor = 1.0 / 1000
        warmup_iters = min(1000, len(train_loader) - 1)
        lr_scheduler = utils.warmup_lr_scheduler(optimizer, warmup_iters, warmup_factor)

    with tqdm(train_loader) as _tqdm:
        for x, y in _tqdm:
            x =
            for key, value in y.items():
                y[key] = torch.tensor(value).to(device)

            y_list = []
            for i in range(0, len(x)):
            outputs = model(x, y_list)
            cur_loss = outputs["my_loss"].detach().item()

            # measure accuracy and record loss
            sample_num = x.size(0)
            loss_monitor.update(cur_loss, sample_num)

            # compute gradient and do step

            if lr_scheduler is not None:

                OrderedDict(stage="train", epoch=epoch, loss=loss_monitor.avg),

    return loss_monitor.avg  # unnecessary

def validate(val_loader, model, epoch, device):
    preds = []
    gt = []
    with torch.no_grad():
        with tqdm(val_loader) as _tqdm:
            for x, y in _tqdm:
                x =

                for key, value in y.items():
                    y[key] = torch.tensor(value).to(device)
                outputs = model(x)
                for output in outputs:
                    pred = np.argmax(
                    )    # just changes format
                _tqdm.set_postfix(OrderedDict(stage="val", epoch=epoch),)

    mae = calculate_mae(gt, np.array(preds))  # my own functions but works well - that's not the problem
    f1 = calculate_f1(gt, preds)
    return mae, f1

def main():
    start_epoch = 0

    device = "cuda" if torch.cuda.is_available() else "cpu"
    if device == torch.device("cuda"):
        cudnn.benchmark = True

    val_dataset = LoadDataset("val")  # normal dataset maker - works fine
    train_dataset = LoadDataset("train")

    model = PornRCNN.create_resnet_50()

    optimizer = torch.optim.Adam(model.parameters(), lr=0.00001)

    model =

    # construct an optimizer
    params = [p for p in model.parameters() if p.requires_grad]

    num_epoch = 100

    for epoch in range(start_epoch, num_epoch):

        val_loader = DataLoader(
            val_dataset, batch_size=24, shuffle=False, num_workers=0
        train_loader = DataLoader(
            train_dataset, batch_size=24, shuffle=False, num_workers=0

        train_loss = train(train_loader, model, optimizer, epoch, device)

        mae, f1 = validate(val_loader, model, epoch, device)

        """later I just check if mae or f1 is better then before and save model"""

Could someone please take a look on it?

A few things come to my mind:

  1. Make sure your training dataset is balanced (not highly skewed towards one class).
  2. Train for more epochs. Plot the loss function and see if the loss values are actually saturating to a very low value.
  3. If Step 2. works fine, ensure you’re not overfitting using the validation set early stopping.

Thanks for ur reply :slight_smile:

  1. I checked it and dataset looks right (it’s well-balanced and also I added oversampling to be sure that in every batch number of elements in every class will be the same)
  2. I checked it too and loss function looks very well actually. It starts from ~0.08 (after first epoch) and stabilize after ~80 epochs to ~0.0005.
  3. I make validation after very epoch of training so how can I make it sooner?

What’s about my code? Maybe there is some stupid mistake but I didn’t catch it?
Maybe any others idea what I should to check?