Model Predictions are all Tensors Full of Zeros

Hi all,

I want to preface this by saying I’m relatively new to this field, so I apologize in advance if the solution is trivial. I’m working to build a prediction model that is able to take general information about the weather and location of an accident to predict the severity of traffic caused as a result of the crash. The dataset I’m using is from Kaggle (link to dataset).

In particular, I’m attemping to use the fields (“Street”, “City”, “County”, “State”, “Zipcode”, “Temperature(F)”, “Wind_Chill(F)”, “Visibillity(mi)”, “Wind_Speed(mph)”, “Precipitation(in)”, and “Weather_Condition”) as the inputs (features) to the model to predict a severity value that ranges from 1 to 4 (labels).

All of the string fields (Street, City, County, State, Zipcode, and Weather_Condition) have been converted to pandas category types and encoded via “cat.codes”. All of the numerical values, with the exception of severity, have been converted to “np.float32” values. The severity is maintained as an integer value.

I split the dataset (train/test sets) via sklearn and run the following several commands to convert the training set into something iterable (DataLoaders):

#Convert Training/Test Datasets from a numpy.ndarray to a tensor.
X_train = torch.from_numpy(X_train)
y_train = torch.from_numpy(y_train)
X_train,y_train=X_train.type(torch.FloatTensor),y_train.type(torch.FloatTensor)
# y_train = torch.from_numpy(y_train).view(-1,1)


X_test = torch.from_numpy(X_test)
# y_test = torch.from_numpy(y_test).view(-1,1)
y_test = torch.from_numpy(y_test)
X_test,y_test=X_test.type(torch.FloatTensor),y_test.type(torch.FloatTensor)


# Make Tensor Data Iterable for Model Training
train = torch.utils.data.TensorDataset(X_train,y_train)
test = torch.utils.data.TensorDataset(X_test,y_test)

train_loader = torch.utils.data.DataLoader(train, batch_size = args.batch_size, shuffle = True, **kwargs)
test_loader = torch.utils.data.DataLoader(test, batch_size = args.test_batch_size, shuffle = True, **kwargs)

For my NN class definition, I do the following:

class ANN(nn.Module):
    def __init__(self, input_dim = 11, output_dim = 4):
        super(ANN, self).__init__()
        self.fc1 = nn.Linear(input_dim, 64)
        self.fc2 = nn.Linear(64, 64)
        self.fc3 = nn.Linear(64, 32)
        self.fc4 = nn.Linear(32, 32)
        self.output_layer = nn.Linear(32,output_dim)
        self.dropout = nn.Dropout(0.15)
        
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.dropout(x)
        x = F.relu(self.fc3(x))
        x = F.relu(self.fc4(x))
        x = self.output_layer(x)    
        return F.log_softmax(x, dim=1)

I have the first layer fc1 with the shape of (11,64) with the intention that I have 11 features/attributes being passed as inputs for the model. My goal is to have the model predict a distinct severity level (so a prediction should simply be 1, 2, 3, or 4; as severity levels are mutually exclusive). I thought as a result of this, the output layer needs to be (32,4) for the four different values of severity. However, when I attempt to train the model with this output layer size, I get a nan loss and the following warning:

With the error in mind, I attempted to change the output layer size to (32,1), despite being unsure how this works, and I was able to get a non-nan loss value (hovering around ~4.x loss) but noticed that my accuracy is 0%. Upon further investigation, I found that my model predictions or “model(data)” value is a tensor full of zeros:
image

when instead, it should be looking something similar to the target:
image

For my loss and optimizer, I do the following:

loss_fn = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr= args.lr, weight_decay= args.weight_decay, momentum = args.momentum)

and for my training/test functions, I have the following:

def train(epoch):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data = Variable(data).float()
        target = Variable(target).type(torch.FloatTensor)
        data, target = data.cuda(), target.cuda()
        data, target = Variable(data), Variable(target)
        optimizer.zero_grad()
        output = model(data)
        loss = loss_fn(output, target)
        loss.backward()
        optimizer.step()
        if batch_idx % args.log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.data.item()))
            step = epoch * len(train_loader) + batch_idx
            log_scalar('train_loss', loss.data.item(), step)

def test(epoch):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data = Variable(data).float()
            target = Variable(target).type(torch.FloatTensor)
            data, target = data.cuda(), target.cuda()
            data, target = Variable(data), Variable(target)
            output = model(data)
            loss = loss_fn(output, target)
            test_loss += loss.item()*data.size(0)
            correct += output.eq(target.data).cpu().sum().item()
 
    test_loss /= len(test_loader.dataset)
    test_accuracy = 100.0 * correct / len(test_loader.dataset)
    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset), test_accuracy))
    step = (epoch + 1) * len(train_loader)
    log_scalar('test_loss', test_loss, step)
    log_scalar('test_accuracy', test_accuracy, step)

and then to call the training/test functions:

import tempfile

sess=tf.compat.v1.InteractiveSession()
with mlflow.start_run() as run:  
  # Log our parameters into mlflow
    for key, value in vars(args).items():
        mlflow.log_param(key, value)
    
    output_dir = tempfile.mkdtemp()
    print("Writing TensorFlow events locally to %s\n" % output_dir)
    writer = tf.summary.create_file_writer(output_dir)
#   writer = tf.summary.FileWriter(output_dir, graph=sess.graph) 
    
    for epoch in range(1, args.epochs + 1):
      # print out active_run
        print("Active Run ID: %s, Epoch: %s \n" % (run.info.run_uuid, epoch))
    
        train(epoch)
        test(epoch)
      
    print("Uploading TensorFlow events as a run artifact.")
    mlflow.log_artifacts(output_dir, artifact_path="events")

Any help as to where I’m going wrong and/or general guidance would be greatly appreciated!

Below are the hyperparameters:

'''
Training Configuration Parameters
'''
class Params(object):
    def __init__(self, batch_size, test_batch_size, epochs, lr, momentum, weight_decay, seed, cuda, log_interval):
        self.batch_size = batch_size
        self.test_batch_size = test_batch_size
        self.epochs = epochs
        self.lr = lr
        self.momentum = momentum
        self.weight_decay = weight_decay
        self.seed = seed
        self.cuda = cuda
        self.log_interval = log_interval
 
# Configure args
args = Params(batch_size=64,
              test_batch_size=64,
              epochs=200,
              lr=0.00001,
              momentum=0.8,
              weight_decay = 1e-6,
              seed=1,
              cuda=True,
              log_interval=200)

cuda = not args.cuda and torch.cuda.is_available()
kwargs = {'num_workers': 1, 'pin_memory': True} if cuda else {}

The usage of a single output dimension with F.log_softmax(x, dim=1) and nn.MSELoss won’t work for several reasons:

  • F.log_softmax(x, dim=1) on a tensor of the shape [batch_size, 1] will always create an all zero tensor (since each “row” would have a probability of 1, thus a log prob of 0)
  • nn.MSELoss is an uncommon criterion for a classification use case and since you are using labels you might want to use nn.NLLLoss or nn.CrossEntropyLoss (if you remove the F.log_softmax)

Hey there @ptrblck! Thanks a ton for the insight!

I looked a little bit deeper into softmax and I think I understand what you’re describing in your first bullet point. Please correct me if I’m wrong, but from what I understood, Softmax looks to find the respective probability of each output label in comparison to the rest. The sum of these probability values adds up to 1 or 100%. So in my case, because there are four output labels, I would have a tensor that essentially looks something like [0.10, 0.35, 0.25, 0.30] per entry. Each of the values in that tensor would correspond to the probability that, that output label is the one that we’re looking for. Together, all 4 values add to 1. With log_softmax taking the log of the softmax output, we’re essentially doing log(1), which is 0. If this is the right way to think about this, what would you recommend as my last activation function then?

In regards to the loss function, I’ll go ahead and look into NLLLoss and CrossEntropyLoss! Is MSE typically only used for regressional problems then, since it doesn’t seem to be ideal for a classification use case?

------------------------------------------------------------------[Update]-----------------------------------------------------------

I went ahead and switched my loss function to use CrossEntropyLoss. Below are the code updates:

loss_fn = nn.CrossEntropyLoss()

def train(epoch):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data = Variable(data).float()
        target = Variable(target).type(torch.FloatTensor)
        data, target = data.cuda(), target.cuda()
        data, target = Variable(data), Variable(target)
        optimizer.zero_grad()
        output = model(data)
#         pred = output.data.max(1)[1]
        target_val = torch.max(target, 1)[1]
        loss = loss_fn(output, target_val)
        loss.backward()
        optimizer.step()
        if batch_idx % args.log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.data.item()))
            step = epoch * len(train_loader) + batch_idx
            log_scalar('train_loss', loss.data.item(), step)

            fc1_weight = model.fc1.weight.data.cpu().numpy()
            fc1_bias = model.fc1.bias.data.cpu().numpy()
            fc2_weight = model.fc2.weight.data.cpu().numpy()
            fc2_bias = model.fc2.bias.data.cpu().numpy()
            fc3_weight = model.fc3.weight.data.cpu().numpy()
            fc3_bias = model.fc3.bias.data.cpu().numpy()
            fc4_weight = model.fc4.weight.data.cpu().numpy()
            fc4_bias = model.fc4.bias.data.cpu().numpy()

            tf.summary.histogram('weights/fc1/weight', fc1_weight, step)
            tf.summary.histogram('weights/fc1/bias', fc1_bias, step)
            tf.summary.histogram('weights/fc2/weight', fc2_weight, step)
            tf.summary.histogram('weights/fc2/bias', fc2_bias, step)
            tf.summary.histogram('weights/fc3/weight', fc3_weight, step)
            tf.summary.histogram('weights/fc3/bias', fc3_bias, step)
            tf.summary.histogram('weights/fc4/weight', fc4_weight, step)
            tf.summary.histogram('weights/fc4/bias', fc4_bias, step)

def test(epoch):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data = Variable(data).float()
            target = Variable(target).type(torch.FloatTensor)
            data, target = data.cuda(), target.cuda()
            data, target = Variable(data), Variable(target)
            output = model(data)
            target_val = torch.max(target, 1)[1]
            loss = loss_fn(output, target_val)
            test_loss += loss.item()*data.size(0)
            pred = output.data.max(1)[1] # get the index of the max log-probability
#             correct += output.eq(target.data).cpu().sum().item()
#             correct += pred.eq(target.data).cpu().sum().item()
 
    test_loss /= len(test_loader.dataset)
    test_accuracy = 100.0 * correct / len(test_loader.dataset)
    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset), test_accuracy))
    step = (epoch + 1) * len(train_loader)
    log_scalar('test_loss', test_loss, step)
    log_scalar('test_accuracy', test_accuracy, step)

I also went ahead and updated my NN class to use Softmax instead of log_softmax, and changed the model to use an output layer of size (32,4) instead of (32,1) as softmax is going to return the probabilities for each output as mentioned above. My NN class updates are as follows:

class ANN(nn.Module):
    def __init__(self, input_dim = 11, output_dim = 4):
        super(ANN, self).__init__()
        self.fc1 = nn.Linear(input_dim, 64)
        self.fc2 = nn.Linear(64, 64)
        self.fc3 = nn.Linear(64, 32)
        self.fc4 = nn.Linear(32, 32)
        self.output_layer = nn.Linear(32,output_dim)
        self.dropout = nn.Dropout(0.15)
        
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.dropout(x)
        x = F.relu(self.fc3(x))
        x = F.relu(self.fc4(x))
        x = self.output_layer(x)
        return F.softmax(x, dim=1)

My “model(data)” output now looks like the following and the value doesn’t seem to be changing:

If I get the indices of the max values, via the following line of code:

        output = model(data)
        pred = output.data.max(1)[1]

The output shows the value doesn’t change:

and the target value calculation outputs the following:

Am I incorrectly using CrossEntropyLoss and Softmax?

No, since nn.CrossEntropyLoss expects raw logits instead of probabilities.
Remove the softmax activation and pass the output of the last linear layer in the shape [batch_size, nb_classes] to this criterion directly.

Hi There! Thanks again for the help. I’ve gone ahead and updated the NN class to remove softmax and return the output of the last layer (shape (64,4)). So I have the following:

class ANN(nn.Module):
    def __init__(self, input_dim = 11, output_dim = 4):
        super(ANN, self).__init__()
        self.fc1 = nn.Linear(input_dim, 64)
        self.fc2 = nn.Linear(64, 64)
        self.fc3 = nn.Linear(64, 32)
        self.fc4 = nn.Linear(32, 32)
        self.output_layer = nn.Linear(32,output_dim)
        self.dropout = nn.Dropout(0.15)
        
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.dropout(x)
        x = F.relu(self.fc3(x))
        x = F.relu(self.fc4(x))
        x = self.output_layer(x)
        return x

This results in an output, from “model(data)” like the following:

The shape of the output is as follows:
image

I pass this directly to the loss function as you mentioned, like so:

        loss_fn = nn.CrossEntropyLoss() 

        output = model(data) 
        target_val = torch.max(target, 1)[1]
        loss = loss_fn(output, target_val)
        loss.backward()

This is returning a loss value of 0 as well as accuracy of 0:

Am I incorrectly passing the raw output of the last layer to the loss function?

The model output looks correct and shows that your model is predicting class0 for all samples (as it has the highest logit).
Based on your previous post, all your target values are also 0 which would yield a small (or close to zero) loss.
I would suggest to check if the target creation is correct and if all zeros are expected, since the target should have a shape of [batch_size] and contain class indices in the range [0, nb_classes-1] so [0, 3] in your case.

Thanks again for your help @ptrblck! It’s very much appreciated as this is my first time building a NN class.

I went ahead and followed up on your last comment and took a look at the target values. I realized, that I was comparing my predictions to the index position of the target value, when I actually needed the raw target value. To resolve the issue, I did the following:

I updated my code from this:

target_val = torch.max(target, 1)[1]

to the following:

target_val = (torch.max(target, 1)[0]-1).type(torch.cuda.LongTensor) 

I realized that the target_val I had before was trying to get the index position of the target value, which didn’t make sense as the target tensor looked like the following:
image
So as you can imagine the index of the highest value in each output is always going to be 0 as there is only one value per output. Instead, I needed to get the target’s raw value (a value from 1-4), which I did by updating the second parameter of torch.max. I then took the raw value and subtracted it by 1, to get the index value (so a value from 0-3). From what I understood from your previous comment, this target value would now match the model prediction which is the index position of the highest logit out of the 4 output values.

As a result of these changes, I get loss values, like the following:

I think that the change I made worked and it made sense to me, but please correct me if I’m wrong. One thing that I noticed, however, was that my accuracy hasn’t really been changing. I made sure to check that my model weights are actively changing and they are, but ever so slightly. I think the issue now is either in my hyperparameters or fundamentally in the way my NN class is set up. I’m currently using the following args:

args = Params(batch_size=512,
              test_batch_size=512,
              epochs=5,
              lr=0.00001,
              momentum=0.8,
              weight_decay = 1e-6,
              seed=1,
              cuda=True,
              log_interval=200)

and my test accuracy looks like the following over 5 epochs:

As you can see, the change in accuracy is extremely minimal. Would you have any recommendations as to what I could do to see that accuracy go up at a faster rate?

Below is also my training loss and test loss:
Training Loss:

Test Loss:

Thanks again for all the help and I look forward to your response!

I think your general approach sounds right and you should indeed use the target values directly (in the range [0, nb_classes-1]).

The initial accuracy seems to already start at ~94% and increases a bit afterwards.
If so, I would guess that your dataset is heavily imbalanced and the model might predict the majority class only?
Could you check the number of samples for each class and see if one of them is used in ~94% of all samples?

You were right once again! The dataset is heavily biased towards severity level two.
image

I’m assuming the best course of action would be to cut down the size of the initial data frame by removing some of the severity levels 2 entries (i.e. try to get a more equal distribution of severity levels)? If so, is it better to go for an equal distribution (straight 25% each) of each severity level or still maintain some bias in the data between the various levels?

-----------------------------------------------------------------Update-------------------------------------------------------------

I went ahead and created a new dataframe that has 10,000 entries of each severity level. I tried using this data set instead to train the model and have been stuck between 28 to 30% accuracy regardless of the number of epochs and adjusting the learning rates between 0.001 and 0.00001. Are 40,000 records sufficient to train a predictive model?

It’s hard to tell how many samples are needed but I think generally the number of samples would roughly scale with the number of parameters in your model.
You could use a WeightedRandomSampler to balance the batches, but since your imbalance is quite severe I wouldn’t know what the best expectation would be.

Not a problem. Thanks again for the information! I’ll try to do some investigating myself to see if I can find a similar dataset that isn’t as biased or try to utilize some of the other attributes in this current dataset! Once again I appreciate all the help!