CNN model seems to be training properly, but returns all 0s when testing some of the time?

I am a novice at PyTorch, currently working on a school project involving the replication of a paper. I am currently creating a CNN to perform binary classification on medical documents (which I cannot share).

First, I activate CUDA, if possible:

device = (
    "cuda"
    if torch.cuda.is_available()
    else "cpu"
)
print(f"Using {device} device")

I load my data like so.

study_corpus_tensor = torch.load("embedded_docs.pt")
study_corpus_tensor = study_corpus_tensor.to(device)

study_corpus_tensor is of size (number of documents, length of longest document, word embedding size). study_corpus_tensor[i, :, :] represents the i-th document. study_corpus_tensor[i, j, :] contains a word embedding (created using Word2Vec) for the j-th word in the i-th document.

This is my dataset class:

class CustomDatasetEmbedded(Dataset):
    def __init__(self, corpus_tensor, labels):
        self.x = corpus_tensor
        self.y = labels

    def __len__(self):
        return len(self.y)

    def __getitem__(self, index):
        return (self.x[index, :, :], self.y[index])

I essentially pass in study_corpus_tensor to be stored as x and a column from a dataframe (1 if a patient had depression, 0 if they did not) to be stored as y.

depression_y = torch.tensor(labelled_corpus_df["Depression"]).to(device)
depression_dataset = CustomDatasetEmbedded(study_corpus_tensor, depression_y )

Here, I do a train test split. I do not have a validation set, as I am currently using the hyperparameters from the study.

train_dataset, test_dataset = torch.utils.data.random_split(depression_dataset, [0.8, 0.2])
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size = 32, shuffle = True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size = 32)

Here is my model. It is based on one of the models described in a paper:

class CNN_1_gram(nn.Module):
    def __init__(self):
        super(CNN_1_gram, self).__init__()
        self.conv1 = nn.Conv2d(in_channels = 1,
                               out_channels = 100,
                               kernel_size = (1, embedding_vector_size),
                               stride = 1,
                               padding = 0)
        
        # each kernel's feature map is condensed to a single value
        conv1_output_height = study_corpus_tensor.shape[1] + 1 - 1
        self.pool1 = nn.MaxPool2d(kernel_size = (conv1_output_height, 1))

        self.do = nn.Dropout(p = 0.5) 

        self.fc= nn.Linear(100, 2) # Input size. 100, for 100 filters here.

        self.activation = nn.LogSoftmax(dim = 1)
        
    def forward(self, x):
        # Provided Lua code (last_layer is probably output of pooling):
        # local output = nn.LogSoftMax()(linear(nn.Dropout(opt.dropout_p)(last_layer)))

        x = torch.unsqueeze(x, dim = 1)

        x1 = self.conv1(x)

        x1 = torch.relu(x1) # Point of this, given the global max pooling?

        x = self.pool1(x1)

        x = self.do(x)

        x = torch.flatten(x, start_dim = 1)
        
        x = self.fc(x)

        x = self.activation(x)

        return x

I instantiate the model:

cnn_1_gram_model = CNN_1_gram().to(device)

I create the optimizer and loss function, as described in the paper:

criterion = nn.modules.loss.NLLLoss()
optimizer = torch.optim.Adadelta(cnn_1_gram_model.parameters(), rho = 0.95, eps = 1e-6)

I create a training function:

n_epochs = 20

def train_model(model, train_dataloader, n_epoch, optimizer, criterion):
    model.train() 
    
    for epoch in range(n_epoch):
        curr_epoch_loss = []

        for x, y in tqdm.tqdm(train_dataloader):
            optimizer.zero_grad()

            y_hat = model(x)
            loss = criterion(y_hat, y)

            loss.backward()

            optimizer.step()

            curr_epoch_loss.append(loss.cpu().data.numpy())

        print(f"Epoch {epoch}: curr_epoch_loss={np.mean(curr_epoch_loss)}")
    return model

I train the model (this is really fast):

cnn_1_gram_model = train_model(model = cnn_1_gram_model,
                               train_dataloader = train_loader,
                               n_epoch = n_epochs,
                               optimizer = optimizer,
                               criterion = criterion)

I create a function to evaluate the model:

def eval_model(model, dataloader):
    model.eval()
    Y_pred  = []
    Y_true  = []
    Y_score = []

    with torch.no_grad():
        for x, y in dataloader:
            Y_true.append(y)
            
            y_hat = model(x)
            
            Y_score.append(y_hat[:, 1])

            # Return class with higher probability
            Y_pred.append(torch.max(y_hat, 1).indices)
            
    Y_score = [y_score.to("cpu") for y_score in Y_score]
    Y_pred  = [y_pred.to("cpu")  for y_pred  in Y_pred]
    Y_true  = [y_true.to("cpu")  for y_true  in Y_true]

    Y_score = np.concatenate(Y_score, axis = 0)    
    Y_pred  = np.concatenate(Y_pred,  axis=0)
    Y_true  = np.concatenate(Y_true,  axis=0)

    return Y_score, Y_pred, Y_true

I evaluate the model:

y_score, y_pred, y_true = eval_model(cnn_1_gram_model, test_loader)

I print some metrics:

print("Predicted percent of patients that have the condition:", np.sum(y_pred) / len(y_pred))
print("Actual percent of patients that have the condition:", np.sum(y_true) / len(y_true))
print("Accuracy:", accuracy_score(y_true, y_pred))
print("Precision:", precision_score(y_true, y_pred))
print("Recall:", recall_score(y_true, y_pred))
print("F1 Score:", f1_score(y_true, y_pred))
print("AUC:", roc_auc_score(y_true, y_score))

My results are extremely bizarre. About half the time, I get fairly good results like those seen below:

Predicted percent of patients that have the condition: 0.2462686567164179
Actual percent of patients that have the condition: 0.2947761194029851
Accuracy: 0.8917910447761194
Precision: 0.8787878787878788
Recall: 0.7341772151898734
F1 Score: 0.8
AUC: 0.9315518049695265

The other half of the time, I get abysmal results, with my model predicting class 0 for every single test example:

Predicted percent of patients that have the condition: 0.0
Actual percent of patients that have the condition: 0.26492537313432835
Accuracy: 0.7350746268656716
Precision: 0.0
Recall: 0.0
F1 Score: 0.0
AUC: 0.8657324658611568

/usr/local/lib/python3.9/dist-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

Still other times, I get nearly all 0s, with a precision of 1 and very low recall (indicating that the threshold to mark an example as positive seems to be extremely high).

I thought that differences in the division of data might be causing this issue, so I did:

for x, y in train_loader:
    print(torch.sum(y))

and

for x, y in test_loader:
    print(torch.sum(y))

but even when 0 was being outputted for every test example, there weren’t any batches without any positive instances in either the training or test dataset.

I found that when 0s are returned for every test instance, class 0 doesn’t have the highest probability for every training instance during the last training epoch. This, combined with the fact that training error always decreases as more and more epochs are completed, indicates that correct predictions are often being made during the training process. However, if I feed the training data into the evaluation function, it still returns 0 for all or nearly all of the training instances, which makes me think that overfitting isn’t the problem (as if there is overfitting, performance on the training data should be good).

This happens no matter how many epochs I choose (1, 5, or 20), and even when I change the loss and optimization functions. It happened both before and after I implemented GPU acceleration.

The authors of the study wrote that “After every parameter update, the parameters of the
feature maps were normalized to a norm of 3.” I talked with a TA about this, and he said it seemed strange and that I could ignore it. Did skipping this do something bad? I’m not sure if 3 is a good norm to use, as I may be using a different embedding size from the authors of the study (they did not record the size they used). How would you do this, anyways?

What might be going on? Did I make a stupid mistake somewhere? This problem seems extremely bizarre. The only thing that should be changing between runs is the division of data into training and test sets, and that should not lead to such large and confusing differences. Is something wrong with my eval_model() function?

This line of code:

Y_score.append(y_hat[:, 1])

looks wrong as you are only using the logits of class1 which might create an invalid AUC score.
However, this would not explain the low accuracy results.

I assume you’ve checked the logits of the model for some samples of the training and test set?
Did you see any trend or values close to each other for the test set?

For the first thing, why is only using the logits for class1 wrong? Should I exponentiate them? It does not seem to make a difference. I do not think you need to give logits for both class 0 and class 1 for binary classification.

Here is a boxplot of the exponentiated test logits for class 1 with seed = 3 (which yields good results):

image

As you can see, the median is low. Most observations will be but in class 0, but a decent amount still end up in class 1, as expected.

Here is a boxplot of the exponentiated test logits for class 1 with seed = 0 (which yields bad results):

image

As you can see, all of the probabilities are below 0.5, so everything gets assigned to class 0, which is incorrect.

I did not notice much difference in the logits observed while training.

Does this indicate anything to you?

You are mixing a multi-class with a multi-label classification and you should not process the logits in isolation since the probability would be calculated via the softmax operation and would thus use all logits:

output = torch.tensor([[5.4, 5.2]])
prob = output.softmax(1)
print(prob)
# tensor([[0.5498, 0.4502]])

The actual value of 5.2 is “high” but doesn’t indicate the predicted class.

What is wrong with my understanding of what is going on?

The values I plotted in the boxplots are the output of calling model() on data, with torch.exp() called on it. model(data) returns a tensor of size (batch size, 2). Each row contains the results of calling LogSoftmax on the outputs of my fully connected layer. Once I call torch.exp() on model(data), each row sums to 1. Why would I call softmax on this?

This would make sense now as it was unclear you are working with log-probabilities while mentioning logits:

which is why I’ve pointed out the usage would be wrong.