One layer network fails at a binary classification task

Hello,
I have been trying to train a one layer network on features extracted with another network pre-trained.
The same features are discriminative when I train linear non-DL classifiers such as a Linear SVM or a logistic regression. However for some reason, when I try to train a neural network on these same features, the loss never gets close enough to zero and at some point starts to stagnate. I have tried everything I could think of: I normalized the features before passing them to the network, I increased/decreased the learning rate…decreased the batch size…
Initially this neural network was intended to be trained on a dataset (that is then passed to the same pre-trained feature extractor) way bigger (2 million samples) than the one I used for the non-DL classifiers (13 000 samples), but I’m trying it now on even the same dataset used for the non-DL classifiers and it’s still not working. I am really puzzled and would appreciate some help.
Here are relevant parts of the code:

the one layer neural network

class Net(nn.Module):
    def __init__(self, embedding_size=512):
        super(Net, self).__init__()

        self.embedding_size= embedding_size 
        self.fc=nn.Linear(self.embedding_size,1)
    
    def forward(self,x):
        x = self.fc(x)
        return x

optimizer and loss

advBackbone=Net(embedding_size=cfg.embedding_size).to(device)
opt_adv_backbone = torch.optim.Adam(
        params=[{'params': advBackbone.parameters()}],
        lr=0.01,
        )
advcriterion = BCEWithLogitsLoss()

training loop

for epoch in range(start_epoch, cfg.num_epoch):
        for i, (img, _, g) in enumerate(train_loader):
            for p in backbone.parameters(): # fixing the pre-trained feature extractor
                p.requires_grad = False

            global_step += 1
            opt_adv_backbone.zero_grad()
            img = img.to(device)
            g = g.to(device)
            features = F.normalize(backbone(img))
            output = advBackbone(features.detach())
            loss_a=advcriterion(output, g.unsqueeze(1).float())
            loss_a.backward()
            opt_adv_backbone.step()
            loss.update(loss_a, 1)

by the way, when I try to let it overfit on one batch, the loss does get close to zero (~0.005). When I try one the whole data, it decreases and then from a certain epoch stagnates around a certain value (0.2) for the small data and (0.3) for the large dataset. What I don’t understand, is how come it fails when a logistic regression or a linear SVM does converge. The fact that these do means that the features, at least extracted for the small data are indeed linearly separable and the neural network defined should be able to converge as well…

Thanks in advance for anyone willing to help

Hi Tanitz!

I agree with your assessment that you should be able to successfully
train a single Linear on this task. The point is that training the single
Linear is fully equivalent to training a linear support-vector machine
(SVM), albeit with a non-standard (and presumably somewhat less
efficient) optimization algorithm.

Some comments:

If your training samples are indeed linearly separable then the prediction
accuracy of your trained linear SVM should be 100%. Do you achieve
perfect accuracy with your SVM?

I do think you should be able to train your Linear network down to a loss
close to zero and a loss of 0.2 or 0.3 does seem too high. But these loss
values can be a little hard to interpret.

What prediction accuracy do you get from your trained Linear? By this I
mean threshold the predictions – as directly output by the Linear – against
zero. Negative means your Linear is predicting class-0 (or the “negative”
class) and positive means class-1 (or the “positive” class). Then calculate
your accuracy as the percentage of these predictions that are correct.

If things were working, you could well have 100% accuracy without having
the loss be zero (but I would nonetheless expect the loss to be much smaller
than, say, 0.2).

The Adam optimizer has the reputation of typically training significantly
faster, but sometimes being more susceptible to getting stuck.

Just to make things simpler and easier to understand, you might try
optimizing with SGD. I would suggest not using weight decay, or at
least using only a very mild weight decay. Using momentum would
probably make sense, but I would suggest at least carrying out a
baseline training run without momentum.

If things still aren’t working, I would recommend distilling your problem
down to a pure numerical linearly-separable classification problem.

Preprocess your data by running it through backbone() (whatever that
is) and store the resulting features as a “new” dataset together with the
labels (which, I imagine, do not need to be preprocessed). Then train
a SVM and a Linear on this distilled dataset.

The point would be to get rid of any extraneous code – train_loader,
backbone, etc. – that might have some bugs hiding in it. Maybe that
fixes something that’s been easy to overlook.

Lastly, if your SVM and Linear still don’t agree, extract the discriminative
hyperplane from your (working) SVM. This will be the vector in “feature
space” (of length equal to the number of features which I presume is
embedding_size) that is orthogonal to the discriminative hyperplane
plus the single scalar that is the offset of the hyperplane from the origin.
These map to the Linear's weight matrix (of shape [1, embedding_size])
and the Linear's bias vector (of shape [1], so a scalar). (There may be
a minus sign in either the weight or the bias mapping, depending on
how things are defined inside of your SVM.)

Then use the extracted SVM parameters to initialize your Linear
network and verify that, when so initialized, your Linear network
makes the same (thresholded) predictions as your SVM. Assuming
this works, try training your Linear, just to make sure that your training
procedure isn’t broken and somehow messing up your Linear.

Good luck.

K. Frank

Hi KFrank!

Thank you for your suggestions. I have said before that the features are linearly separable but actually they aren’t “perfectly” linearly separable. The linear SVM does find a hyperplane that classifies most of the points correctly but not all of them (over 90% accuracy, and with an area under the ROC curve over 90% as well…). I do expect the linear network to converge to a similar solution… I will calculate the accuracy at the stagnating lowest loss point and let you know…