Label prediction stuck at 0 for labels in binary classification using cross-entropy loss

Is the way I am calculating the loss or pred_labels wrong? I am getting really low accuracy values on my val and test sets. The dataset is somewhat balanced actually and large enough. I am doing binary classification here.

30% of my dataset is class 0 and 70% is class 1 and dataset includes ~2000 2D tensors of different size ranging from 100x512 to 8000x512 with a median size of 1200x512.

class Classifier(nn.Module):
    
    def __init__(self, n_class, batch_size):
        super(Classifier, self).__init__()
        self.batch_size = batch_size
        self.transformer = VisionTransformer()
        #self.criterion = nn.CrossEntropyLoss(reduce=False)
        #self.criterion = nn.BCELoss(reduce=False)
        #self.criterion = nn.BCEWithLogitsLoss(reduce=False) # weighted loss
        #self.criterion = nn.BCEWithLogitsLoss() # balanced loss
        #self.criterion = nn.BCELoss()
        self.criterion = nn.CrossEntropyLoss()


    def forward(self, X, labels):

        stacked_X = torch.stack(X)
        out = self.transformer(stacked_X)
        #labels = torch.tensor(labels, dtype=torch.float32)
        labels = torch.tensor(labels)
        #m = nn.Sigmoid()
 
        with torch.cuda.amp.autocast():
            print(out[:,1]-out[:,0])
            #loss = self.criterion(m(out[:,1]-out[:,0]), labels.cuda())
       
            loss = self.criterion(out, labels.cuda())

        #pred = out.data.max(1)[1]
        
        pred_labels = out.argmax(dim=1)
        labels = labels.int()
        return pred_labels, labels, loss

evaluator.get_scores 0.3194444444444444

For calculating accuracy, I am using this code snippet:

class ConfusionMatrix(object):

    def __init__(self, n_classes):
        self.n_classes = n_classes
        # axis = 0: prediction
        # axis = 1: target
        self.confusion_matrix = np.zeros((n_classes, n_classes))

    def _fast_hist(self, label_true, label_pred, n_class):
        hist = np.zeros((n_class, n_class))
        hist[label_pred, label_true] += 1

        return hist

    def update(self, label_trues, label_preds):
        for lt, lp in zip(label_trues, label_preds):
            tmp = self._fast_hist(lt.item(), lp.item(), self.n_classes)    #lt.item(), lp.item()
            self.confusion_matrix += tmp

    def get_scores(self):
        """Returns accuracy score evaluation result.
            - overall accuracy
            - mean accuracy
            - mean IU
            - fwavacc
        """
        hist = self.confusion_matrix
        # accuracy is recall/sensitivity for each class, predicted TP / all real positives
        # axis in sum: perform summation along

        if sum(hist.sum(axis=1)) != 0:
            acc = sum(np.diag(hist)) / sum(hist.sum(axis=1))
            print('acc is: ', acc)
        else:
            acc = 0.0

        return acc

    def plotcm(self):
        print(self.confusion_matrix)

    def reset(self):
        self.confusion_matrix = np.zeros((self.n_classes, self.n_classes))

and during the test phase, with 1 epoch, I am using this:

if epoch % 1 == 0:
        with torch.no_grad():
            model.eval()
            print("evaluating...")

            total = 0.
            batch_idx = 0
            val_preds = []
            val_labels = []
            predictions = []
            actuals = []
            for i_batch, sample_batched in enumerate(dataloader_val):         
                val_pred, val_label, val_loss = evaluator.eval_test(sample_batched, model)
                val_epoch_loss += val_loss
                val_preds.extend(val_pred.tolist())
                val_labels.extend(val_label)
                total += len(val_label)
                evaluator.metrics.update(torch.tensor(val_label).cuda(), val_pred)
            print('evaluator.get_scores', evaluator.get_scores())
    

Here’s how out from transformer looks like:

transformer out:  tensor([[ 0.4381, -0.6186],
        [ 0.4252, -0.4492],
        [ 1.0657, -0.5201],
        [ 0.8421, -0.6315],
        [ 0.9444, -0.5340],
        [ 0.9247, -0.6726],
        [ 1.1587, -0.9463],
        [ 1.0038, -1.0780],
        [ 1.4244, -1.0721],
        [ 0.4215, -0.7684],
        [ 0.7522, -0.8166],
        [ 1.2995, -0.9579],
        [ 0.8080, -0.6492],
        [ 1.0144, -0.5562],
        [ 1.0666, -1.0291],
        [ 0.3030, -0.7651],
        [ 0.5221, -0.6741],
        [ 1.1583, -0.4493],
        [ 0.6098, -1.0080],
        [ 0.3495, -1.0742],
        [ 0.2278, -0.7298],
        [ 0.5189, -0.6456],
        [ 0.3409, -0.3661],
        [ 0.9637, -0.9262],
        [ 1.0781, -0.9345],
        [ 1.0993, -1.0937],
        [ 0.8297, -0.6071],
        [ 0.5423, -1.1961],
        [ 0.7860, -0.6777],
        [-0.2522, -0.9376],
        [ 0.6013, -0.9057],
        [ 0.9975, -0.1858]], device='cuda:0', grad_fn=<AddmmBackward0>)
labels:  tensor([1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1,
        0, 0, 1, 1, 0, 1, 1, 0], dtype=torch.int32)
pred labels:  tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0')

Is there training part?
And it is strange that you have two outputs for binary classification like there are two classes

There is a mlp_head in the transformer hence why it returns two values one for each of the classes:

        self.mlp_head = nn.Sequential(
            nn.LayerNorm(dim),
            nn.Linear(dim, num_classes) 
        )

The tensor I have shown end of the post belongs to training, yeah.

I am wondering why the transformer first element happens to be larger than second element? What are some reason for this?

transformer out:  tensor([[ 0.2996, -0.0972],
        [ 0.2175, -0.2273],
        [ 0.2899, -0.0668],
        [ 0.3128, -0.1157],
        [-0.1371, -0.3733],
        [ 0.2676, -0.1176],
        [ 0.3138, -0.1216],
        [ 0.3049, -0.1100],
        [ 0.2373, -0.2062],
        [-0.0232, -0.3582],
        [-0.1412, -0.3851],
        [-0.0985, -0.4127],
        [ 0.3014, -0.1182],
        [ 0.2232, -0.1620],
        [ 0.2547, -0.1686],
        [ 0.2677, -0.1570],
        [-0.3329, -0.4284],
        [ 0.1175, -0.2559],
        [ 0.3286, -0.2217],
        [ 0.3033, -0.0996],
        [ 0.3201, -0.1559],
        [ 0.1466, -0.2372],
        [ 0.3162, -0.1330],
        [ 0.3052, -0.1036],
        [ 0.3127, -0.1366],
        [ 0.3376, -0.2449],
        [ 0.3011, -0.0993],
        [ 0.3058, -0.0974],
        [ 0.3293, -0.1865]], device='cuda:0')

Reason why second value is always less than first one (if everything else in training loop is correct) is that model always predicts first class over second. Something is amiss here, like correctness of labels.

Also, you don’t really need two outputs if your samples belong to just first or second class but not both simultaneously. You can use BCELoss (BCEWithLogitsLoss) with single output, like 0 is class no.1 and 1 is class no.2