Simple pattern is not detected by simple linear model

pemfir · July 29, 2020, 4:43am

I have a very simple example of 6 training data with 6 labels. I am trying to fit a simple logistic regression to it, and hoping it should over-fit with 0 loss, but the model does not converge.

Here is the rule, i am hoping the model to learn:

When input is [1., 0., 0., 1.] , the label is 1
When input is [0., 1., 1., 0.] , the label is 1
When input is [1., 0., 1., 0.] , the label is 0
When input is [0., 1., 0., 1.] , the label is 0

is such pattern something a non-linear model cannot capture ? Any idea what i am doing wrong ?

import torch
import torch.nn as nn


training_samples = torch.tensor([[0., 1., 1., 0.],
                                 [1., 0., 0., 1.],
                                 [0., 1., 0., 1.],
                                 [1., 0., 1., 0.]])

labels = torch.tensor([1., 1., 0., 0.]).view(-1, 1)


def normalize(x):
    x_normed = (x - x.mean(0, keepdim=True)[0])/(x.std(0, keepdim=True)[0])
    return x_normed



class LogisticRegression(torch.nn.Module):
    def __init__(self):
        super(LogisticRegression, self).__init__()
        self.linear1 = torch.nn.Linear(4, 2) 
        self.linear2 = torch.nn.Linear(2, 2) 
        self.linear3 = torch.nn.Linear(2, 1)
        
        
    def forward(self, x):
        x = torch.relu(self.linear1(x))
        x = torch.relu(self.linear2(x))
        x = torch.sigmoid(self.linear3(x))
        return x
    


model = LogisticRegression()
criterion = torch.nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=.01)


model.train()
for epoch in range(10):
    optimizer.zero_grad()
    shuffle = torch.randperm(len(labels))  # shuffling the training data 
    labels = labels[shuffle] # shuffle labels
    training_samples = training_samples[shuffle]# shuffle features 
    y_pred = model(normalize(training_samples))    # normalization does not help
    loss = criterion(y_pred, labels)   
    loss.backward()
    optimizer.step()
    print('Epoch %d | Loss: %.4f' % (epoch, loss.item()))

KFrank · July 29, 2020, 8:12pm

Hello pemfir!

The short answer is that a variant of your neural network can capture
this pattern if you add a third “hidden neuron” to your first layer.

That is, change:

class LogisticRegression(torch.nn.Module):
    def __init__(self):
        super(LogisticRegression, self).__init__()
        self.linear1 = torch.nn.Linear(4, 2) 
        self.linear2 = torch.nn.Linear(2, 2) 
        self.linear3 = torch.nn.Linear(2, 1)
...

to something like this:

    def __init__(self):
        super(LogisticRegression, self).__init__()
        self.linear1 = torch.nn.Linear(4, 3) 
        self.linear2 = torch.nn.Linear(3, 2) 
        self.linear3 = torch.nn.Linear(2, 1)
...

Note, that to train this “three-neuron” network far along to a low loss, I
increased the learning rate to 0.1 and the number of epochs to 1000.

Some further comments:

I don’t have a good explanation as to why adding this neuron is
necessary / sufficient, or how to analyze what other minimal tweaks
would work.

Upon inspection, your samples contain redundant information:

s[i, 1] == s[i, 2] if, and only if s[i, 0] == s[i,3]. And your
labels are 1 when this equality holds.

We can write a simple (quadratic) formula for your labels:

l[i] = s[i, 1] * s[i, 2] + (1 - s[i, 1]) * (1 - s[i, 2])

Neural networks can certainly reproduce (good approximations to)
quadratic functions, but I don’t have a good understanding of how
minimal a neural network can be and still do this.

It’s interesting to note that if we “help” the network by adding a single
product to the input sample, you can train successfully with only two
hidden neurons in the first layer.

So, let’s get rid of the redundant inputs, and add the (mathematically
redundant) helper product:

training_samples = torch.tensor([[1., 1., 1.],
                                 [0., 0., 0.],
                                 [1., 0., 0.],
                                 [0., 1., 0.]])

Here is a pytorch version 0.3.0 script that illustrates training with
a third neuron and with the helper product:

import torch
torch.__version__

torch.manual_seed (2020)

samples = torch.autograd.Variable (torch.FloatTensor ([
    [1., 1.],
    [0., 0.],
    [1., 0.],
    [0., 1.]
]))

samples_aug = torch.autograd.Variable (torch.FloatTensor ([
    [1., 1., 1.],
    [0., 0., 0.],
    [1., 0., 0.],
    [0., 1., 0.]
]))

labels = torch.autograd.Variable (torch.FloatTensor ([1., 1., 0., 0.]).view(-1, 1))

nInput = 2
nHidden = 2

model_2_2 = torch.nn.Sequential(
    torch.nn.Linear (nInput, nHidden),
    torch.nn.ReLU(),
    torch.nn.Linear (nHidden, 2),
    torch.nn.ReLU(),
    torch.nn.Linear (2, 1),
    torch.nn.Sigmoid()
)

nInput = 3
nHidden = 2

model_3_2 = torch.nn.Sequential(
    torch.nn.Linear (nInput, nHidden),
    torch.nn.ReLU(),
    torch.nn.Linear (nHidden, 2),
    torch.nn.ReLU(),
    torch.nn.Linear (2, 1),
    torch.nn.Sigmoid()
)

nInput = 2
nHidden = 3

model_2_3 = torch.nn.Sequential(
    torch.nn.Linear (nInput, nHidden),
    torch.nn.ReLU(),
    torch.nn.Linear (nHidden, 2),
    torch.nn.ReLU(),
    torch.nn.Linear (2, 1),
    torch.nn.Sigmoid()
)

nInput = 3
nHidden = 3
model_3_3 = torch.nn.Sequential(
    torch.nn.Linear (nInput, nHidden),
    torch.nn.ReLU(),
    torch.nn.Linear (nHidden, 2),
    torch.nn.ReLU(),
    torch.nn.Linear (2, 1),
    torch.nn.Sigmoid()
)

criterion = torch.nn.BCELoss()


optimizer_2_2 = torch.optim.SGD (model_2_2.parameters(), lr=.1)

for epoch in range (1000):
    shuffle = torch.randperm (len (labels))  # shuffling the training data 
    y_pred = model_2_2 (samples[shuffle])
    loss = criterion (y_pred, labels[shuffle])
    optimizer_2_2.zero_grad()
    loss.backward()
    optimizer_2_2.step()
    if  (epoch + 1) % 100 == 0:
        print ('Epoch %d | Loss: %.4f' % (epoch, loss.data[0]))
        print ('model_2_2 (samples) =', [round (v, 4)  for v in model_2_2 (samples).data.squeeze().tolist()])


optimizer_3_2 = torch.optim.SGD (model_3_2.parameters(), lr=.1)

for epoch in range (1000):
    shuffle = torch.randperm (len (labels))  # shuffling the training data 
    y_pred = model_3_2 (samples_aug[shuffle])
    loss = criterion (y_pred, labels[shuffle])
    optimizer_3_2.zero_grad()
    loss.backward()
    optimizer_3_2.step()
    if  (epoch + 1) % 100 == 0:
        print ('Epoch %d | Loss: %.4f' % (epoch, loss.data[0]))
        print ('model_3_2 (samples_aug) =', [round (v, 4)  for v in model_3_2 (samples_aug).data.squeeze().tolist()])


optimizer_2_3 = torch.optim.SGD (model_2_3.parameters(), lr=.1)

for epoch in range (1000):
    shuffle = torch.randperm (len (labels))  # shuffling the training data 
    y_pred = model_2_3 (samples[shuffle])
    loss = criterion (y_pred, labels[shuffle])
    optimizer_2_3.zero_grad()
    loss.backward()
    optimizer_2_3.step()
    if  (epoch + 1) % 100 == 0:
        print ('Epoch %d | Loss: %.4f' % (epoch, loss.data[0]))
        print ('model_2_3 (samples) =', [round (v, 4)  for v in model_2_3 (samples).data.squeeze().tolist()])


optimizer_3_3 = torch.optim.SGD (model_3_3.parameters(), lr=.1)

for epoch in range (1000):
    shuffle = torch.randperm (len (labels))  # shuffling the training data 
    y_pred = model_3_3 (samples_aug[shuffle])
    loss = criterion (y_pred, labels[shuffle])
    optimizer_3_3.zero_grad()
    loss.backward()
    optimizer_3_3.step()
    if  (epoch + 1) % 100 == 0:
        print ('Epoch %d | Loss: %.4f' % (epoch, loss.data[0]))
        print ('model_3_3 (samples_aug) =', [round (v, 4)  for v in model_3_3 (samples_aug).data.squeeze().tolist()])

And here is the output:

>>> import torch
>>> torch.__version__
'0.3.0b0+591e73e'
>>>
>>> torch.manual_seed (2020)
<torch._C.Generator object at 0x00000210DE836630>
>>>
>>> samples = torch.autograd.Variable (torch.FloatTensor ([
...     [1., 1.],
...     [0., 0.],
...     [1., 0.],
...     [0., 1.]
... ]))
>>>
>>> samples_aug = torch.autograd.Variable (torch.FloatTensor ([
...     [1., 1., 1.],
...     [0., 0., 0.],
...     [1., 0., 0.],
...     [0., 1., 0.]
... ]))
>>>
>>> labels = torch.autograd.Variable (torch.FloatTensor ([1., 1., 0., 0.]).view(-1, 1))
>>>
>>> nInput = 2
>>> nHidden = 2
>>>
>>> model_2_2 = torch.nn.Sequential(
...     torch.nn.Linear (nInput, nHidden),
...     torch.nn.ReLU(),
...     torch.nn.Linear (nHidden, 2),
...     torch.nn.ReLU(),
...     torch.nn.Linear (2, 1),
...     torch.nn.Sigmoid()
... )
>>>
>>> nInput = 3
>>> nHidden = 2
>>>
>>> model_3_2 = torch.nn.Sequential(
...     torch.nn.Linear (nInput, nHidden),
...     torch.nn.ReLU(),
...     torch.nn.Linear (nHidden, 2),
...     torch.nn.ReLU(),
...     torch.nn.Linear (2, 1),
...     torch.nn.Sigmoid()
... )
>>>
>>> nInput = 2
>>> nHidden = 3
>>>
>>> model_2_3 = torch.nn.Sequential(
...     torch.nn.Linear (nInput, nHidden),
...     torch.nn.ReLU(),
...     torch.nn.Linear (nHidden, 2),
...     torch.nn.ReLU(),
...     torch.nn.Linear (2, 1),
...     torch.nn.Sigmoid()
... )
>>>
>>> nInput = 3
>>> nHidden = 3
>>> model_3_3 = torch.nn.Sequential(
...     torch.nn.Linear (nInput, nHidden),
...     torch.nn.ReLU(),
...     torch.nn.Linear (nHidden, 2),
...     torch.nn.ReLU(),
...     torch.nn.Linear (2, 1),
...     torch.nn.Sigmoid()
... )
>>>
>>> criterion = torch.nn.BCELoss()
>>>
>>>
>>> optimizer_2_2 = torch.optim.SGD (model_2_2.parameters(), lr=.1)
>>>
>>> for epoch in range (1000):
...     shuffle = torch.randperm (len (labels))  # shuffling the training data
...     y_pred = model_2_2 (samples[shuffle])
...     loss = criterion (y_pred, labels[shuffle])
...     optimizer_2_2.zero_grad()
...     loss.backward()
...     optimizer_2_2.step()
...     if  (epoch + 1) % 100 == 0:
...         print ('Epoch %d | Loss: %.4f' % (epoch, loss.data[0]))
...         print ('model_2_2 (samples) =', [round (v, 4)  for v in model_2_2 (samples).data.squeeze().tolist()])
...
Epoch 99 | Loss: 0.6932
model_2_2 (samples) = [0.5061, 0.5061, 0.5061, 0.5061]
Epoch 199 | Loss: 0.6931
model_2_2 (samples) = [0.5002, 0.5002, 0.5002, 0.5002]
Epoch 299 | Loss: 0.6931
model_2_2 (samples) = [0.5, 0.5, 0.5, 0.5]
Epoch 399 | Loss: 0.6931
model_2_2 (samples) = [0.5, 0.5, 0.5, 0.5]
Epoch 499 | Loss: 0.6931
model_2_2 (samples) = [0.5, 0.5, 0.5, 0.5]
Epoch 599 | Loss: 0.6931
model_2_2 (samples) = [0.5, 0.5, 0.5, 0.5]
Epoch 699 | Loss: 0.6931
model_2_2 (samples) = [0.5, 0.5, 0.5, 0.5]
Epoch 799 | Loss: 0.6931
model_2_2 (samples) = [0.5, 0.5, 0.5, 0.5]
Epoch 899 | Loss: 0.6931
model_2_2 (samples) = [0.5, 0.5, 0.5, 0.5]
Epoch 999 | Loss: 0.6931
model_2_2 (samples) = [0.5, 0.5, 0.5, 0.5]
>>>
>>> optimizer_3_2 = torch.optim.SGD (model_3_2.parameters(), lr=.1)
>>>
>>> for epoch in range (1000):
...     shuffle = torch.randperm (len (labels))  # shuffling the training data
...     y_pred = model_3_2 (samples_aug[shuffle])
...     loss = criterion (y_pred, labels[shuffle])
...     optimizer_3_2.zero_grad()
...     loss.backward()
...     optimizer_3_2.step()
...     if  (epoch + 1) % 100 == 0:
...         print ('Epoch %d | Loss: %.4f' % (epoch, loss.data[0]))
...         print ('model_3_2 (samples_aug) =', [round (v, 4)  for v in model_3_2 (samples_aug).data.squeeze().tolist()])
...
Epoch 99 | Loss: 0.6926
model_3_2 (samples_aug) = [0.5092, 0.5072, 0.5093, 0.5058]
Epoch 199 | Loss: 0.6889
model_3_2 (samples_aug) = [0.5008, 0.4941, 0.5008, 0.4852]
Epoch 299 | Loss: 0.6755
model_3_2 (samples_aug) = [0.5063, 0.4866, 0.5063, 0.448]
Epoch 399 | Loss: 0.6013
model_3_2 (samples_aug) = [0.542, 0.5209, 0.5416, 0.2982]
Epoch 499 | Loss: 0.4251
model_3_2 (samples_aug) = [0.6544, 0.6344, 0.5311, 0.0498]
Epoch 599 | Loss: 0.1344
model_3_2 (samples_aug) = [0.7972, 0.7972, 0.0621, 0.0127]
Epoch 699 | Loss: 0.0678
model_3_2 (samples_aug) = [0.8842, 0.8842, 0.0163, 0.0068]
Epoch 799 | Loss: 0.0442
model_3_2 (samples_aug) = [0.9216, 0.9216, 0.0081, 0.0046]
Epoch 899 | Loss: 0.0325
model_3_2 (samples_aug) = [0.9414, 0.9414, 0.0052, 0.0035]
Epoch 999 | Loss: 0.0255
model_3_2 (samples_aug) = [0.9534, 0.9534, 0.0036, 0.0027]
>>>
>>> optimizer_2_3 = torch.optim.SGD (model_2_3.parameters(), lr=.1)
>>>
>>> for epoch in range (1000):
...     shuffle = torch.randperm (len (labels))  # shuffling the training data
...     y_pred = model_2_3 (samples[shuffle])
...     loss = criterion (y_pred, labels[shuffle])
...     optimizer_2_3.zero_grad()
...     loss.backward()
...     optimizer_2_3.step()
...     if  (epoch + 1) % 100 == 0:
...         print ('Epoch %d | Loss: %.4f' % (epoch, loss.data[0]))
...         print ('model_2_3 (samples) =', [round (v, 4)  for v in model_2_3 (samples).data.squeeze().tolist()])
...
Epoch 99 | Loss: 0.6537
model_2_3 (samples) = [0.538, 0.5671, 0.4874, 0.53]
Epoch 199 | Loss: 0.5327
model_2_3 (samples) = [0.6096, 0.67, 0.3885, 0.529]
Epoch 299 | Loss: 0.2323
model_2_3 (samples) = [0.8878, 0.8619, 0.2533, 0.3082]
Epoch 399 | Loss: 0.0967
model_2_3 (samples) = [0.9735, 0.9617, 0.145, 0.1497]
Epoch 499 | Loss: 0.0577
model_2_3 (samples) = [0.9885, 0.9832, 0.0957, 0.0955]
Epoch 599 | Loss: 0.0408
model_2_3 (samples) = [0.9925, 0.9894, 0.0694, 0.0694]
Epoch 699 | Loss: 0.0308
model_2_3 (samples) = [0.9951, 0.993, 0.0547, 0.054]
Epoch 799 | Loss: 0.0249
model_2_3 (samples) = [0.996, 0.9945, 0.0439, 0.0439]
Epoch 899 | Loss: 0.0210
model_2_3 (samples) = [0.9964, 0.9953, 0.037, 0.037]
Epoch 999 | Loss: 0.0178
model_2_3 (samples) = [0.9973, 0.9964, 0.0318, 0.0318]
>>>
>>> optimizer_3_3 = torch.optim.SGD (model_3_3.parameters(), lr=.1)
>>>
>>> for epoch in range (1000):
...     shuffle = torch.randperm (len (labels))  # shuffling the training data
...     y_pred = model_3_3 (samples_aug[shuffle])
...     loss = criterion (y_pred, labels[shuffle])
...     optimizer_3_3.zero_grad()
...     loss.backward()
...     optimizer_3_3.step()
...     if  (epoch + 1) % 100 == 0:
...         print ('Epoch %d | Loss: %.4f' % (epoch, loss.data[0]))
...         print ('model_3_3 (samples_aug) =', [round (v, 4)  for v in model_3_3 (samples_aug).data.squeeze().tolist()])
...
Epoch 99 | Loss: 0.6778
model_3_3 (samples_aug) = [0.5113, 0.4968, 0.5173, 0.4568]
Epoch 199 | Loss: 0.5863
model_3_3 (samples_aug) = [0.5727, 0.5431, 0.5767, 0.2665]
Epoch 299 | Loss: 0.4890
model_3_3 (samples_aug) = [0.6795, 0.6138, 0.6359, 0.0656]
Epoch 399 | Loss: 0.1760
model_3_3 (samples_aug) = [0.9167, 0.802, 0.2872, 0.0332]
Epoch 499 | Loss: 0.0286
model_3_3 (samples_aug) = [0.9821, 0.965, 0.0446, 0.0129]
Epoch 599 | Loss: 0.0114
model_3_3 (samples_aug) = [0.9931, 0.9846, 0.0166, 0.0063]
Epoch 699 | Loss: 0.0066
model_3_3 (samples_aug) = [0.9963, 0.9905, 0.0092, 0.0037]
Epoch 799 | Loss: 0.0044
model_3_3 (samples_aug) = [0.9977, 0.9933, 0.0061, 0.0025]
Epoch 899 | Loss: 0.0033
model_3_3 (samples_aug) = [0.9984, 0.9949, 0.0044, 0.0019]
Epoch 999 | Loss: 0.0026
model_3_3 (samples_aug) = [0.9988, 0.9959, 0.0034, 0.0015]

Good luck.

K. Frank

pemfir · July 30, 2020, 6:14pm

Thank you for your response. I realized i needed to work on the optimization parameters. For example choosing between Least squared loss or cross entropy, ADAM optimizer seems to be quite better than SGD, and that helped me a lot.