Why does using weight decay result in my linear classifier failing to train?

Vishal_Jain1 · February 21, 2021, 4:35pm

I am trying to understand pytorch through a toy example of trying to train a perceptron to classify some data points. I am using a sigmoid for activation and binary cross entropy as my loss.
This is my code:

import torch 
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import torch.optim as optim
import matplotlib.pyplot as plt
import numpy as np

# set random seed
torch.manual_seed(0)
# define perceptron
class Perceptron(nn.Module):
    def __init__(self):
        super(Perceptron,self).__init__()
        self.linear = nn.Linear(in_features=2,out_features=1)
    
    def forward(self,x):
        x = self.linear(x)
        x = torch.sigmoid(x)
        return x
# create test data points
class testData(Dataset):
    def __init__(self):
        super(testData,self).__init__()
        test_data = [([3,2],1),([1,1],0)]
        self.data = test_data
    
    def __getitem__(self,index):
        dp,label = self.data[index]
        dp = torch.FloatTensor(dp)
        label = torch.tensor(label)
        return dp,label
    
    def __len__(self):
        return len(self.data)

def main():
    # epochs at which we plot the model
    epoch_samples = [10,20,30,40,50,60,70,80,90,100]
    fig,(ax1,ax2) = plt.subplots(nrows=2)
    # plot the data points
    ax1.scatter([3],[2],color='b')
    ax1.scatter([1],[1],color='g')

    # instantiate the model, optimiser and dataloader 
    model = Perceptron()
    dataset = testData()
    dataloader = DataLoader(dataset,batch_size=2)
    optimiser = optim.SGD(model.parameters(),lr=1,weight_decay=0)
    # train for 100 epochs
    for epoch in range(101):
        total_loss=0
        for idx,batch in enumerate(dataloader):
            dp,label = batch
            preds = model(dp)
            loss = F.binary_cross_entropy(preds.float(),label.unsqueeze(1).float())
            optimiser.zero_grad()
            loss.backward()
            optimiser.step()
            total_loss += loss.item()
        # add model to plot if epoch in epoch samples list
        if epoch in epoch_samples:
            weights= model.linear.weight.detach().numpy()
            w0 = weights[0,0]
            w1 = weights[0,1]
            bias = model.linear.bias.detach().numpy()[0]
            x = np.linspace(0,5,50)
            y = -((w0*x+bias)/w1)
            # set the alpha coeffecient based on position of epoch in list
            a = ((epoch_samples.index(epoch)+1)/len(epoch_samples))
            # plot model
            ax1.plot(x,y,color='r',alpha=a)
            # add the abs weight value against epoch data point
            ax2.scatter(epoch,[np.abs(w0)+np.abs(w1)],color='b') 
    # output the plot
    plt.show()


if __name__ == "__main__":
    main()

Output plot:

The top figure shows the output model, with a fainter line indicating an earlier epoch than a darker line. The bottom plot shows epoch no. against the sum of the absolute values of the weights.

I first thought to use weight decay because while the model was successfully classifying the points and the loss was decreasing with every epoch, the model with the lowest error should be the line which is a perpendicular bisector to the line connecting the centre of the two points. I wouldve thought that as the model continued to try and decrease the binary cross entropy error, it would tend towards becoming this perpendicular bisector. Instead, if you zoom into the plot it seems that the model is tending towards a line which is just barely classifying the points correctly (it seems to be rotating clockwise with each epoch) !

So I thought that this behaviour is occurring because the model is not being penalised for just increasing the strength of the weights to decrease the loss.

Introducing weight decay with lambda = 0.2 seems to just throw the model astray completely.

(new users cant input two images, so just set the weight decay parameter to 0.2 and you should see the problem)

Why ?

I guess the question I am asking is, given some model which correctly separates the data, there are two ways to decrease the loss, one is to increase the strength of the weights, while maintaining their relative ratio (so the actual gradient of the line is left unchanged), the other is to change the ratio of the weights such that the line tends towards being the perpendicular bisector of the line which connects the data point centres.

I thought L2 regularisation would prevent the model from just increasing the strengths of the weights blindly to decrease the loss, and instead try to tend towards the perpendicular bisector. Instead L2 loss fails to give me a model which separates the data points entirely… Why?