Low loss but undesirable F-1 score! Binary classifier

Hi Community,

Thanks to the posts within this community. The following code that takes numerical inputs that are 1 x 6156 (in the range of 0 to 1) and classifies them in 2 classes [0 or 1]. With a 10 layer network I was about to get to a low loss (0.000089) but the test data gives a 60% on the F-1 score. Previous architecture had a loss of 0.002 with an F-1 score of 68%. This is counter intuitive as to why the lower loss architecture did worse. Other posts say it could be something in the code — is it possible I am doing something wrong in the testing or training?

Hope this code can also help others that don’t deal with images !

Side note: for the forward, only having the sigmoid in the first layer gave the lowest loss. I tried all other variations and it worsen the training.

X_train,X_test,y_train,y_test = train_test_split(Concatenated_x,Concatenated_y,random_state = 28,test_size=0.3151)

from torch.utils.data import Dataset, DataLoader
class Data(Dataset):
    def __init__(self, X_train, y_train):
        self.X = torch.from_numpy(X_train.astype(np.float32))
        self.y = torch.from_numpy(y_train).type(torch.LongTensor)
        self.len = self.X.shape[0]
    def __getitem__(self, index):
        return self.X[index], self.y[index]
    def __len__(self):
        return self.len

traindata = Data(np.asarray(X_train), np.asarray(y_train))

trainloader = DataLoader(traindata, batch_size=batch_size,shuffle=True, num_workers=0)
import torch.nn as nn
input_dim = 6156 #
hidden_layers = 12312 
hidden_layers2 = 6156 
hidden_layers3 = 3078 
hidden_layers4 = 1539 
hidden_layers5 = 770 
hidden_layers6 = 385 
hidden_layers7 = 192 
hidden_layers8 = 96 
hidden_layers9 = 48 
hidden_layers10 = 24 
output_dim = 2

class Network(nn.Module):
    def __init__(self):
        super(Network, self).__init__()
        self.linear1 = nn.Linear(input_dim, hidden_layers)
        self.linear2 = nn.Linear(hidden_layers, hidden_layers2)
        self.linear3 = nn.Linear(hidden_layers2, hidden_layers3)
        self.linear4 = nn.Linear(hidden_layers3, hidden_layers4)
        self.linear5 = nn.Linear(hidden_layers4, hidden_layers5)
        self.linear6 = nn.Linear(hidden_layers5, hidden_layers6)
        self.linear7 = nn.Linear(hidden_layers6, hidden_layers7)
        self.linear8 = nn.Linear(hidden_layers7, hidden_layers8)
        self.linear9 = nn.Linear(hidden_layers8, hidden_layers9)
        self.linear10 = nn.Linear(hidden_layers9, hidden_layers10)
        self.linear11 = nn.Linear(hidden_layers10, output_dim )

    def forward(self, x):
        x = torch.sigmoid(self.linear1(x)) #original
        x = (self.linear2(x))
        x = (self.linear3(x))
        x = (self.linear4(x))
        x = (self.linear5(x))
        x = (self.linear6(x))
        x = (self.linear7(x))
        x = (self.linear8(x))
        x = (self.linear9(x))
        x = (self.linear10(x))
        x = (self.linear11(x))
        return x
valid_loss_min = np.Inf
path = "Model_1.pth"

for epoch in range(epochs):
    running_loss = 0.0
    loss_values = []
    for i, data in enumerate(trainloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data[0].to(device), data[1].to(device)
#         inputs, labels = data
        # Clear the gradients
        # Forward Pass
        outputs = clf(inputs)
        _, preds = torch.max(outputs, 1)
        labels = labels.squeeze_()
        # Find the Loss
        loss = criterion(outputs, labels) 
        # Calculate Gradients
        # Update Weights
        # Calculate Loss
        running_loss += loss.item() * inputs.size(0) # Loss of batch
#     print(f'[{epoch + 1}, {i + 1:5d}] loss: {running_loss / len(traindata):.5f}') # Original
    print('epoch {}, loss {:.6f}'.format(epoch, loss.item()))
    ## save the model if validation loss has decreased
    if loss <= valid_loss_min:
            print('Validation loss decreased ({:.6f} --> {:.6f}). Saving model..'.format(valid_loss_min, loss))
            torch.save(clf.state_dict(), path)
            print('Model Saved')

Now for the testing code, it’s as follows:

clf.eval() #  set the dropout and batch normalization layers to evaluation mode

testdata = Data(np.asarray(X_test), np.asarray(y_test))
testloader = DataLoader(testdata, batch_size=batch_size, 
                        shuffle=False, num_workers=0)

dataiter = iter(testloader)
inputs, labels = dataiter.next()

outputs = clf(inputs) # The outputs are energies for the 2 classes, higher energy, the more the network thinks its a class
__, predicted = torch.max(outputs, 1) # get the index of the highest energy

nb_classes = 2
from sklearn.metrics import precision_recall_fscore_support as score
confusion_matrix = torch.zeros(nb_classes, nb_classes)
with torch.no_grad():
    for i, (inputs, classes) in enumerate(testloader):
        inputs = inputs
        classes = classes
        outputs = clf(inputs)
        _, preds = torch.max(outputs, 1)
        for t, p in zip(classes.view(-1), preds.view(-1)):
            confusion_matrix[t.long(), p.long()] += 1

        cm = confusion_matrix.cpu().data.numpy()
        recall = np.diag(cm) / np.sum(cm, axis = 1)
        precision = np.diag(cm) / np.sum(cm, axis = 0)

It seems you are dealing with an imbalanced dataset, so the loss and accuracy don’t seem to reflect the metric you care about (f1_score).
Here is a small demonstration showing that the accuracy paradox might indeed be misleading:

# setup using imbalanced use case
N = 1000
y_true = np.zeros(N)
y_true[-2:] = 1.
y_pred = np.zeros(N)
y_pred[-2:] = 1.

# perfet results
print(accuracy_score(y_true, y_pred))
# 1.0
print(f1_score(y_true, y_pred))
# 1.0

# 2/1000 mismatches
y_pred = np.zeros(N)
print(accuracy_score(y_true, y_pred))
# 0.998
print(f1_score(y_true, y_pred))
# 0.0

so you might want to either balance the samples using a WeightedRandomSampler or use loss weighting.

1 Like

Thank you for the reply ptrblck, makes sense that the loss is disproportional based on the imbalanced data. For some reason Matlab gives me a 88% F1 with the same setup and imbalance and I don’t understand why I can’t get the same results here. The test data will also have this imbalance…

But I believe there is also something else happening since after calculating F1-Score, which shouldn’t be affected by the imbalanced data, I get the following results:

Loss of 0.000376 gives 68% F1
Loss of 0.000089 gives 60% F1, shouldn’t the confusion matrix still show better results (less # values for FN, FP)?

Is this still explained by the accuracy paradox?

Side question, is it okay to only have the sigmoid activation in the first layer? I can’t get a definite answer. In this code it seems to give the best results for some reason…

Thank you for always replying! :smiley:

Yes, I think that can still be the case and you could manually verify it by using the confusion matrix and comparing different metrics to the loss calculation.
Depending on the used criterion the outputs would also represent how “sure” the model is about the prediction and would of course yield a larger loss for wrong predictions with a high confidence value while the accuracy could stay the same.

I don’t know and haven’t seen a lot of sigmoid activations in common models, but if this activation function works fine for your use case it sounds like a reasonable choice.

I balanced the data (1:1) and it gave me great results on training but actual data that is unbalanced, the results are 20% (F1-score).

I tried using the WeightedRandomSampler, is this the right implementation?

labels = np.array(traindata)[:,1] # turn to array and take all of column index 1 which are the labels

labels = labels.astype(int) # change to int

num_of_majority_class_training_examples = 121087 # Hard code number of 0's
num_of_minority_class_training_examples =2817 # Hard code number of 1's

majority_weight = 1/num_of_majority_class_training_examples

minority_weight = 1/num_of_minority_class_training_examples

sample_weights = np.array([majority_weight, minority_weight]) # This is assuming that your minority class is the integer 1 in the labels object. If not, switch places so it's minority_weight, majority_weight.

weights = sample_weights[labels] # this goes through each training example and uses the labels 0 and 1 as the index in sample_weights object which is the weight you want for that class.

sampler = WeightedRandomSampler(weights=weights, num_samples= len(traindata), replacement=True)

trainloader = DataLoader(traindata, batch_size = batch_size, sampler=sampler)

Test set is still not great. I may be coding wrong?

The balancing looks correct and the training results seem also to reflect it.
Is the validation set still overfitting to the majority class or what does the confusion matrix show?

1 Like

The Validation results are odd, but the confusion matrix for the new dataset is okay, it would be ideal if I can improve on these results…

This is the training

This is the test


New Data