Neural Network learning to predict only one class with binary cross entropy function

Hello, I have very basic problem with training classification MLP network - I’m trying to train a network for simple classification task on randomly generated dataset with a bit imbalanced classes (59 observations of class 0 and 140 of class 1), and I can’t seem to teach the NN to distinguish between them, it always just simply predicts all the classes to class 1. I tried to use different weighting of classes, but it didn’t help. I know that the dataset is small, but unfortunately that was the task given by the teacher, and I shouldn’t make the dataset bigger. Here is my code:

class NeuralNet(nn.Module):
  def __init__(self, n_input, n_hidden1, n_hidden2, output_shape):
    super().__init__()
    self.hidden2 = nn.Linear(n_hidden1, n_hidden2)
    self.hidden = nn.Linear(n_input, n_hidden1)
    self.outer = nn.Linear(n_hidden1, output_shape)
    self.sigmoid = nn.Sigmoid()
    self.softmax = nn.Softmax(dim=1)
    
  def forward(self, data):
    step1 = self.hidden(data)
    step2 = self.sigmoid(step1)
    step3 = self.hidden2(step2)
    step4 = self.sigmoid(step3)
    step5 = self.outer(step4)
    step6 = self.softmax(step5)
    return step6

model = NeuralNet(4, 5, 5, 2)
model = model.float()
num_epoch = 100
learning_rate = 0.001
loss_function = F.binary_cross_entropy
optimizer = optim.SGD(model.parameters(), lr=learning_rate)#, weight_decay=1e-4, momentum=0.9)

#training loop:
train_scores = []
test_scores = []
for epoch in range(num_epoch):
  scores = []
  model.train()
  for x_batch, y_batch in train_dataloader:
    y_pred = model(x_batch).squeeze()
    y_batch = y_batch.squeeze().float()
    
    loss = loss_function(y_pred, y_batch, weight=torch.tensor([0.6, 0.4]))
    loss.backward()
    
    optimizer.step()
    optimizer.zero_grad()
    scores.append(loss.item())
    
  model.eval()
  train_loss = np.mean(scores)
  with torch.no_grad():
        test_loss = sum(loss_function(model(xb).squeeze(), yb.squeeze().float()) for xb, yb in test_dataloader)
  test_loss = test_loss/len(test_dataloader)
  #print(f"Epoch {epoch} train loss: {train_loss}, test loss: {test_loss}")
  train_scores.append(train_loss)
  test_scores.append(test_loss)

Hello Ra!

Your passing the wrong information / shape to binary_cross_entropy.

binary_cross_entropy expects one prediction value per sample, to
be understood as the probability of that sample being in class “1”.
(It expects a single target value per sample, as well.) You construct
your last linear layer to have two outputs – you should have one.

When you switch to a single output, you will need to switch from
Softmax to Sigmoid in your final layer (after the last linear layer).
However, for numerical reasons, you will be better off using
binary_cross_entropy_with_logits, fed directly by your last
linear layer, and not have a Sigmoid.

In order to reweight your unbalanced training data (You might
not need to.) you will need to use the per-sample weight
argument to binary_cross_entropy_with_logits. That is,
if a specific sample in your batch is class “0” you give it the
reweighting weight you want for class “0”; if it’s class “1”, you
give it the class “1” weight.

One other note: It doesn’t matter in your specific case because
you have n_hidden1 = n_hidden2 (= 5), but you should have
n_hidden2 inputs to your last linear layer:
self.outer = nn.Linear(n_hidden2, output_shape)

(You can treat this as a multiclass classification problem – with just
two classes – and have an output layer with two outputs, but then
you would need to use cross_entropy. I would expect it to be
modestly more efficient to train a network that has one output with
binary_cross_entropy.)

Good luck!

K. Frank

2 Likes

Thanks for this comprehensive answer, it explains a lot! :slight_smile:

It’s me again, I’ve made suggested corrections, but unfortunately the network still doesn’t learn anything, and now it started to classify all observations to class 0, which has resulted in even worse accuracy. I’ve used binary_cross_entropy_with_logits as loss function, with one dimension of outputs of last linear layer of the network without passing it into sigmoid function, but directly to loss function

Hello Ra!

You didn’t mention it, but did you also change the y_batch you pass
to your binary_cross_entropy_with_logits loss_function?
The individual target samples in your y_batch should now be single
class labels that are either 0 or 1. (They could be probabilities
between 0 and 1, but I’m guessing you’re not doing that.)

Other than that, things look right. I don’t see anything wrong with
your code (but I miss things all the time). In any event, as a general
debugging practice, you should go over your code with a fine-tooth
comb, and maybe print out some intermediate results here and there
to make sure things look like what you expect.

About the data, am I correct that a single input sample (an element
of x_batch) has four “features,” that is, consists of four numbers?
And that a single target sample (an element of y_batch) consists
of a single 0-or-1 class label?

If you look at the data manually, can you classify it? Is the problem
clearly doable, or does it look hard or impossible? You mention a
“randomly generated dataset.” If it’s really random (depending what
you mean by that), maybe there isn’t really any link between the
samples (x_batch) and the labels (y_batch), so you’d be trying
to build a classifier for random noise.

What happens if you make your own data in the same format as
the data you’re working with, but obviously and easily classifiable?
(For example, you could generate your four “features” randomly,
and set class = 1 when sum (features) > 0, and class = 0,
otherwise.) Does your network train successfully on trivial data?

What is your batch size? Have you tried running with a smaller and
larger learning_rate? num_epoch = 100 seems like it should be
enough to at least see something for a problem of this size, but have
you tried running for a lot more epochs?

Good luck!

K. Frank

Thanks again for responding so quickly! Yes, my data consist of 4 randomly created variables in excel with functions int() and rand() (full excel function =int(rand()*8)+1 ), so their values differed from 1 to 8, but I also tried normalizing them which didn’t help either. Then response variable was created with simple decision tree-like structure-> if X[0] >1 and X[1] >=2 and X[2] >=3 and X[3] >= 4 then 0 else 2 (full excel function =IF(A4>1;IF(B4>=2;IF(C4>=3;IF(D4>=4;1;2);2);2);2) ), so it should be separable, I trained decision tree classifier on it that achieved 100% accuracy, but of course it was way easier for it as this function is actually a decision tree. I am passing single observation in each batch with 4 independent variables and response variable, so probably DataLoader is a bit overkill here, but I wrote it for educational purposes, full code for data loading looks like this:

df_raw = pd.read_excel("data.xlsx").rename(columns={'PG':"Preference Group"})
df_raw['Preference Group'] = df_raw['Preference Group'] - 1

X_numpy = df_raw.drop("Preference Group", axis=1).astype('float64').values
X_numpy = X_numpy - np.mean(X_numpy, axis=0)
X_numpy = X_numpy - np.std(X_numpy, axis=0)
y_numpy = df_raw['Preference Group'].astype('float64').values
X_train_np, X_test_np, y_train_np, y_test_np =  train_test_split(X_numpy, y_numpy, test_size=0.3, stratify=y_numpy, random_state=42)

X_train = torch.tensor(X_train_np, dtype=torch.long)
X_test = torch.tensor(X_test_np, dtype=torch.long)
y_train = torch.tensor(y_train_np, dtype=torch.long)
y_test = torch.tensor(y_test_np, dtype=torch.long)
#option with one hot encoding and 2 dimensional output of neural network
#encoder = OneHotEncoder(categories='auto')
#encoder.fit(y_train_np.reshape(-1, 1))
#y_train = torch.tensor(encoder.transform(y_train_np.reshape(-1, 1)).todense(), dtype=torch.float)
#y_test = torch.tensor(encoder.transform(y_test_np.reshape(-1, 1)).todense(), dtype=torch.float)

train_dataset = TensorDataset(X_train.float(), y_train)
test_dataset = TensorDataset(X_test.float(), y_test)
train_dataloader = DataLoader(train_dataset, batch_size=1, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=64, shuffle=False)

I will try all other suggestions proposed by you, so maybe they will help to find out more about why does it look like that, and if maybe it’s really about the data I used, not the model
All the best!
Rafał

Hi Rafal!

For binary_cross_entropy, your response variable (what is often
called the target) should be a probability in the range [0, 1]. (If we
use just the values 0 and 1, we can understand the target to be the
class label for classes “0” and “1”.)

You say your target is 0 or 2 (but your excel formula looks like 1 or
2). Either way, it falls outside the required range of [0, 1], so the loss
function won’t work properly. If your target really does take on the
value 2, this is likely your problem.

Best.

K. Frank

Hey, thanks again for helping for so long! My formula gives class value of 1 or 2 because that was code provided by the lecturer, but if you look at the code I have a line were I change it to 0 or 1:

df_raw['Preference Group'] = df_raw['Preference Group'] - 1

the 0 or 2 is a typo in my comment, sorry for that