A really interesting thing I find when using weight decay training on CIFAR 10 dataset

So I basically try to implement Highway Network to classify CIFAR 10 dataset. Usually, as we know weight decay is a thing we use to reduce overfitting. But I don’t know why, in my case when I have 0 weight decay, my model fail to learn well and it underfit the model by a lot. It’s only when I start to add some weight decay and the model start the fit the data better. Can someone tell me why that happens? It seem to be complete the opposite.

Below is my attempted code

class HighWayBlock(nn.Module):
  def __init__(self, input_channel, output_channel, kernel_size = 3, stride=1, padding=0):
    super(HighWayBlock, self).__init__()
    self.skip = nn.Sequential(nn.Conv2d(input_channel, output_channel, 1, bias=False), nn.BatchNorm2d(output_channel))
    self.T = nn.Sequential(nn.Conv2d(input_channel, output_channel, kernel_size, padding=padding), nn.Sigmoid())
    self.conv1 = nn.Conv2d(input_channel,output_channel, kernel_size, stride,padding)
    self.relu = nn.ReLU()
    self.batchnorm = nn.BatchNorm2d(output_channel)
    self.input_channel = input_channel
    self.output_channel = output_channel

  def forward(self, x):
    val = x
    if self.input_channel != self.output_channel:
      val = self.skip(val)
    T = self.T(x)
    x = self.conv1(x)
    x = self.relu(x)
    x = self.batchnorm(x)   
    combine = x * T + val * (1-T)
    return combine

class HighWayModel(nn.Module):
  def __init__(self, input_channel=3, kernel_size=3):
    super(HighWayModel, self).__init__()
    self.hw1 = HighWayBlock(3, 16, padding=same_padding(kernel_size))
    self.hw2 = HighWayBlock(16, 16, padding=same_padding(kernel_size))
    self.hw3 = HighWayBlock(16, 32, padding=same_padding(kernel_size))
    self.hw4 = HighWayBlock(32, 32, padding=same_padding(kernel_size))
    self.hw5 = HighWayBlock(32, 64, padding=same_padding(kernel_size))
    self.hw6 = HighWayBlock(64, 64, padding=same_padding(kernel_size))
    self.maxpool = nn.MaxPool2d(2)
    self.flatten = nn.Flatten()
    self.linear1 = nn.Linear(1024, 2048)
    self.linear2 = nn.Linear(2048, 256)
    self.linear3 = nn.Linear(256,10)
    self.relu = nn.ReLU()
    self.dropout = nn.Dropout(0.5)
    self.batchnorm1 = nn.BatchNorm1d(2048)
    self.batchnorm2 = nn.BatchNorm1d(256)

  def forward(self,x):
    x = self.hw1(x)
    x = self.maxpool(x)
    x = self.hw2(x)
    x = self.maxpool(x)
    x = self.hw3(x)
    x = self.hw4(x)
    x = self.maxpool(x)
    x = self.hw5(x)
    x = self.hw6(x)
    x = self.flatten(x)
    x = self.linear1(x)
    x = self.relu(x)
    x = self.batchnorm1(x)
    x = self.dropout(x)
    x = self.linear2(x)
    x = self.relu(x)
    x = self.batchnorm2(x)
    x = self.dropout(x)
    x = self.linear3(x)
    return x

def weight_init(m):

    if isinstance(m, nn.Conv2d):
        nn.init.normal_(m.weight, 0.0, 0.02)
    if isinstance(m, nn.BatchNorm2d) or isinstance(m, nn.BatchNorm1d):
        nn.init.normal_(m.weight, 0.0, 0.02)
        nn.init.constant_(m.bias, 0)
    if isinstance(m, nn.Linear):
    if isinstance(m, HighWayBlock):
        for layer in m.modules():
          if isinstance(layer, nn.Conv2d):
            nn.init.constant_(layer.weight, -1)

optimizer = optim.Adam(model.parameters(), lr=l, weight_decay=k)

That’s my basic setup. I have tried to use learning rate 0.0005, 0.0008 and both learning rate returns the same result. When I have weight decay 0. The validation accuracy is like 55 percent, and when I have weight decay 0.001 it goes up to 75-76 percent.

I am guessing either implement the Highway Net wrong or something else happened that I am not aware of. I also implement a ResNet and it shows no such behavior