Bias breaks learning?

pytorcher · February 24, 2018, 6:03am

I have a really basic conv net training on MNIST. just conv2d->relu->conv2d->relu->CrossEntropyLoss. When I initialize the Conv2d with bias=True it will train over many epochs through the dataset while remaining at chance accuracy (10%). However when bias=False it achieves 90% accuracy after 5 epochs. Can anyone explain why this might be happening? If it matters the last conv2d is really a linear layer, kernel size is same as input size.

SimonW · February 24, 2018, 5:32pm

That sounds weird. Do you mind posting your script?

pytorcher · February 24, 2018, 6:43pm

Don’t mind. Not sure how to correctly format posting code here, hope this is right. I started with the pytorch wide-resnet code since I will be doing stuff later that requires putting it back that way. https://github.com/meliketoy/wide-resnet.pytorch. Flipping bias from True to False makes it work/not work repeatably.

modified networks/wide_resnet.py:

import torch
import torch.nn as nn
import torch.nn.init as init
import torch.nn.functional as F
from torch.autograd import Variable
import math
import sys
import numpy as np
import pdb
def conv_init_pr(m):
    for m in m.modules():
        if isinstance(m, nn.Conv2d):
            n = m.kernel_size[0]*m.kernel_size[1]*m.out_channels
            m.weight.data.normal_(0, math.sqrt(2./n))
class wide_basic(nn.Module):
    def __init__(self, in_planes, planes, dropout_rate, stride=1, kernel_size=3):
        super(wide_basic, self).__init__()
        self.conv2 = nn.Conv2d(in_planes, planes, kernel_size=kernel_size, stride=stride, padding=1, bias=True)
    def forward(self, x):
        out = self.conv2(F.relu(x))
        return out
class Wide_ResNet(nn.Module):
    def __init__(self, depth, widen_factor, dropout_rate, num_classes):
        super(Wide_ResNet, self).__init__()
        self.in_planes = 3
        n = 1
        k = 1
        self.layer1 = self._wide_layer(wide_basic, 16, n, dropout_rate, stride=2, kernel_size=3)
        self.layer2 = self._wide_layer(wide_basic, 10, n, dropout_rate, stride=2, kernel_size=16)
    def _wide_layer(self, block, planes, num_blocks, dropout_rate, stride, kernel_size):
        strides = [stride] + [1]*(num_blocks-1)
        layers = []
        for stride in strides:
            layers.append(block(self.in_planes, planes, dropout_rate, stride, kernel_size))
            self.in_planes = planes
        return nn.Sequential(*layers)
    def forward(self, x):
        out2 = self.layer1(x)
        out3 = self.layer2(out2)
        out4 = F.relu(out3)
        return out4.squeeze()
if __name__ == '__main__':
    net=Wide_ResNet(28, 10, 0.3, 10)
    y = net(Variable(torch.randn(1,3,32,32)))

    print(y.size())

...

SimonW · February 24, 2018, 7:36pm

I’m not spotting anything weird… Could you try different optimizer and lr combinations?

pytorcher · February 25, 2018, 5:09pm

Tried with SGD, Adam, and Adadelta. Learning rate 1-0.001 every e10. Same every time, learns without bias, doesn’t learn with bias.

pytorcher · February 26, 2018, 4:07am

Have also tried now with the last layer being a linear layer and once again does not work with bias, works with bias=False. Also tried manually adding bias with the following setting bias=False for layers and does not work.

self.values = nn.Parameter(torch.Tensor(planes).zero_().cuda(), requires_grad=True)
self.register_parameter('values', self.values)

Edit: Actually it does with with the output linear layer having a bias, but the hidden convolutional layer bias will break it.

ptrblck · February 26, 2018, 9:49am

Have you tried to initialize the bias with zeros?
Add the following line to conv_init_pr(m) and try to run it again:

m.bias.data.zero_()

pytorcher · February 26, 2018, 6:19pm

No luck with that either. Did confirm that bias starts out at 0’s when added.

pytorcher · February 26, 2018, 7:52pm

Alright, thanks to both of you for the replys! and Zeroing the bias does help now that it’s working!

Problem was that my MNIST data is 0-1 and PIL Image was expecting 0-255 so all the data was small. Multiplying input data by 255 makes it work. I still do not understand why this specific problem would work/not work because of bias, seems like if small data was a problem it would also not work without the bias term. But at least it’s working, so thanks again!