Simple Binary Classification NN doesn't converge?

Hi everyone. I’m building a simple model for a binary classification task on the German Credit Numeric dataset.
I have a numpy array of inputs (24 features), followed by a numpy array of outputs {0,1} as follows:


X[0]: [ 3  6  4 13  2  5  1  4  3 28  3  2  2  2  1  1  0  1  0  0  1  0  0  1]
X shape: (750, 24)


y[0]: 1
Y shape: (750,)

My model is a simple FC model with ReLU on top:

class LinearModel(nn.Module):
    def __init__(self):
        super(LinearModel, self).__init__()
        # 24 Features
        self.fc1 = nn.Linear(24, 32, device=device)
        self.fc2 = nn.Linear(32, 16, device=device)
        self.fc3 = nn.Linear(16, 1, device=device)
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x[0] # We want the single value only, not a list with a single value

I decided to use BCEWithLogitsLoss(), since I did not put the sigmoid function on the last layer of my model, along with the Adam() optimizer:

loss = nn.BCEWithLogitsLoss()
cnn = LinearModel()
opt = torch.optim.Adam(cnn.parameters(), lr = 1e-3)
epochs = 100

and my train function as follows:

num_correct_train = 0
num_samples_train = 0
num_correct_val = 0
num_samples_val = 0

valAccuracies = []
trainAccuracies = []

trainLosses = []
valLosses = []

avgTrainLosses = []
avgValLosses = []

for epoch in range(1, epochs+1):
  # We put the CNN in training mode

  # We iterate over the train set, taking batches of xb, yb.
  for xb, yb in zip(Xtrain, ytrain):
      xb = torch.from_numpy(xb).float()
      yb = torch.tensor(yb).float()
      xb, yb =, # Move them to the GPU
      opt.zero_grad() # We empty the gradients

      ypred = cnn(xb) # Actual prediction of our model
      lTrain = loss(ypred, yb) # We perform a Binary Classification Entropy Loss with Logits between the real value and our prediction
      lTrain.backward() # We compute the gradients
      opt.step() # Parameters updated -> Single step optimization 
      # We round the prediction and:
      # if == ground truth -> Correct prediction
      # if != ground truth -> Wrong prediction
      ypred_tag = torch.round(torch.sigmoid(ypred))

      if(ypred_tag == yb):
        num_correct_train += 1
      num_samples_train += 1

  with torch.no_grad():
    cnn.eval() # We put our model in evaluation mode for testing the accuracy over the test set
    for xb, yb in zip(Xtest, ytest):

      xb = torch.from_numpy(xb).float()
      yb = torch.tensor(yb).float()
      xb, yb =, # Move them to the GPU
      ypred = cnn(xb)      
      lVal = loss(ypred, yb) # We perform a Binary Classification Entropy Loss with Logits between the real value and our prediction
      ypred_tag = torch.round(torch.sigmoid(ypred))
      if(ypred_tag == yb):
        num_correct_val += 1
      num_samples_val += 1


  train_acc = float(num_correct_train)/float(num_samples_train)
  val_acc = float(num_correct_val)/float(num_samples_val)
  if (epoch % 5 == 0):
    print(f'Epoch {epoch+0:03}:\nTrain Loss: {np.average(trainLosses):.5f} | Train Accuracy: {train_acc:.3f} /----/ Val Loss: {np.average(valLosses):.5f} | Val Accuracy: {val_acc:.3f}')

But after training the network for almost 30-40 epochs, it seems like the validation loss and validation accuracy tend to get “flattened” and always keep values around 0.55 (loss) and 0.73 (accuracy).

As for the training loss and accuracy, after 30-40 epochs, they tend to go from 0.47 (Loss) to 0.45 and 0.77 (Accuracy) to 0.78 which is not converging that fast neither.

I don’t get why it doesn’t properly converge faster in the very beginning of the epochs.

Feels like I could train this network for hours and still get a value of accuracy oscillating in the range of 0.73.

Am I doing something wrong?


Hi Slim!

If I understand correctly what you are doing, you have a training dataset
that consists of 750 individual samples and you are using a batch size of
one sample – that is, you take optimization steps based on the gradients
from single individual samples.

It’s conceivable that the single-sample gradients are simply too noisy for
the optimization to make any systematic progress (without using a very
small learning rate).

Try training with significantly larger batches. For example, you could use
a batch size of 15, so that an epoch would consist of 50 optimization steps
(each based on 15 individual samples). Or you could try a batch size of
75. Or maybe even a batch size of 750 where an epoch would consist of
only one optimization step, but using gradients averaged over your entire
training dataset.

Whether you will get good validation results will depend on the character
of your data. But I would certainly expect you to be able to overfit your
training dataset so as to get a low loss and a high accuracy.

Note, your model is not particularly complicated, but it still consists of a
reasonably large number of parameters, so you may need significantly
more than 40 epochs to train / overfit effectively.

Also, there is some lore that training can get stuck when you use ReLU,
as its gradients can vanish. If the input to ReLU is negative, the gradient
is exactly zero, and the optimizer has no idea which way to move to get
out of the zero-gradient region. So you might try some other activation,
such as LeakyReLU, that is designed to avoid this problem.


K. Frank

Hello Frank and thank you so much for answering.

You opened my thoughts about the batch size (I even thought about it, honestly).

I did convert my numpy arrays to two dataloaders as follows:

TrainTensor = TensorDataset(torch.Tensor(Xtrain),torch.Tensor(ytrain))
TrainDataLoader = DataLoader(TrainTensor, batch_size=750, shuffle=True)

TestTensor = TensorDataset(torch.Tensor(Xtest),torch.Tensor(ytest))
TestDataLoader = DataLoader(TestTensor, batch_size=750, shuffle=False)

and my training schedule now takes 250 samples of X (with 24 features) and 250 labels, since I’m iterating like this:
for xb, yb in TrainDataLoader:

and I’m checking for the accuracy as follows:

ypred_tag = torch.round(torch.sigmoid(ypred))
num_correct_train = (ypred_tag == yb).sum().float()
train_acc = torch.round(num_correct_train/yb.shape[0] * 100)

where ypred is the output of my model (250x1) and train_acc is the average between the correct ones and the total number of predictions. I do the same also for the test set.

I did change my ReLU function with LeakyReLU over all the hidden layers in my model.

I ran 300 epochs and the final result is as follows:

Epoch 300: Train Loss: 0.40169 | Train Accuracy: 81.000 /----/ Val Loss: 24.78004 | Val Accuracy: 35.000

I really don’t know why my validation loss is so high, I haven’t changed a lot of things except for reshaping my numpy arrays to be dataloaders.

Please note that the train/val accuracy/loss is that specific accuracy/loss for that epoch (in that case 300).

Overall it started with these values (I’m printing every 50 epochs):

Epoch 50: Train Loss: 0.59462 | Train Accuracy: 71.000 /----/ Val Loss: 2.69253 | Val Accuracy: 68.000

I tested it on 1000 epochs and the training accuracy reached 97% and loss 0.1, while validation is still the opposite (32% and 280 loss - which is absurd).

Hi Slim!

You are most likely overfitting your training data. This means that your
network is, in effect, “memorizing” your individual training samples without
actually learning to perform your general classification task.

The good news is that you are successfully training your network (to overfit).
The bad news is that your network won’t successfully classify data samples
that are not in your training set.

(You should always be open to the possibility that you simply have a bug
somewhere, so check your code.)

The best solution to overfitting is to train with (much) more training data
(if you have it).

A related approach is to augment your training data (although I don’t
know off-hand whether there is a sensible way to augment your “Credit
Numeric” data).

You might also consider adding Dropout layers to your network.

One last question: Is your validation data of the same character as your
training data? Does it come from the same source? For example, do
you have a larger dataset that you randomly split into your training and
validation datasets so that they are statistically identical?

If your training and validation datasets are rather different in character
it becomes harder to train your model to perform well on your validation
data – if you can do so at all – and is likely to require more training data
in order to do so.


K. Frank

I just checked my code and there was just one bug where I did scale my train data, but not my test data. So the features in the test data were a bit different from the train one. With this being said:

I’m thinking hard on how to augment my data apart from scaling feature values between 0 and 1.

The German Credit dataset has been given by my professor and can be found here where the first 23 columns represent a potential client (Salary, Job, Age, etc.), while last column represent the label 0 (Bad) and 1 (Good).
This dataset it’s composed by 1000 samples and does not have a test set itself, so the professor splitted it 75% train and 25% test (750 samples for train and 250 samples for test).

By adding dropout with probability 0.5 at each output of LeakyReLU, now my model performs this way.

Epoch 1000:
	- Training accuracy: 90.000 (avg. 82.341)
	- Training loss: 0.268 (avg. 0.386)
	- Validation accuracy: 71.000 (avg. 70.331)
	- Validation loss: 0.768 (avg. 0.641)

I think my model it’s still attracted to train data and still overfits, as the validation loss still rises up.

I’ll leave some screenshots below.

Anyways thank you for your kind help, I do really appreciate it! :slight_smile:


Hi Slim!

This could be a significant issue. Scaling your training and validation data
differently could certainly degrade inference on your validation data or cause
it not to work at all.

Input values that are naturally continuous (such as Salary and Age) could
reasonably have some “noise” added to them. So, for example, a sample
for which Salary = €100,000 could be modified to make an augmented
sample by changing the Salary to €107,000.

The idea is that the model can’t simply “memorize” that the one sample
that happens to have Salary exactly equal to €100,000 is “bad”. It has
to learn that a sample with such-and-such a Job and such-and-such an
Age with a Salary that is about €100,000 is likely to be bad.

(It’s not so clear to me that you could sensibly augment your data by
modifying a discrete label such as Job.)

This looks like significant progress. It does look like your training starts
to overfit somewhere around epoch 150, but, still, you’ve brought your
validation down significantly and your accuracy up.

It might be worth trying some experiments to see whether this improvement
is due to consistent scaling of your training and validation data or whether
it’s due to the addition of Dropouts (or maybe both).

Last, you can make your model less likely to overfit by making it smaller,
either “narrower” or “shallower” or both. Your model is already quite simple,
so it’s not clear how much room for improvement you would have here, but
you could try reducing your internal numbers of features from 32 and 16 to,
say, 16 and 8. (Of course reducing the size of your model could also make
it so small that it can’t effectively learn your actual classification problem.)


K. Frank

Hi Frank!
I did try to modify my hidden layers from 24->12->6->1 and this significantly brought down my validation loss (less overfit).
It looks almost stable now. I played with dropout as well, where higher value of p brings more stability to the validation loss curve (even tho I could see some starting overfitting process around epoch 400/500).

I did not try to augment my data further, but I left my data as it was originally + scaling values between 0 and 1.

This is my overall final results:

And I think this is my final result actually, since checking the official doc of German Credit Dataset, it shows (in the scheme at the bottom) that this dataset performs on an average of 64% of accuracy and maximum of 70% in a Neural Network. It performs better on ML algorithms.

With the plots above I did reach this result:

Epoch 1000:
	- Training accuracy: 78.000 (avg. 69.438)
	- Training loss: 0.470 (avg. 0.577)
	- Validation accuracy: 74.000 (avg. 69.822)
	- Validation loss: 0.577 (avg. 0.599)

which seems really consistent and stable.

Thank you for your time Frank, you really helped me out a lot. Wish we would connect for some other tasks! :slight_smile: