Problem with BatchNorm? Accuracy stays constant throughout training

JackFurby · June 3, 2020, 2:52pm

Hi,

I am trying to train a MobileNet V1 model to perform action recognition on UCF-101, but the weight of every BatchNorm in the model ends up at just about the same value. On top of this, each value in a BatchNorm layer are approximately the same. The following as an example:

10.4.weight Parameter containing:
tensor([0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0772, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0772, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0772, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0772, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773,
        0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773, 0.0773],
       device='cuda:0')

As far as I can tell this is the cause of an issue I am facing where the model output is locked to a single class despite train and val loss both falling during training. The following image shows a few stats of training:

If I set affine=False then accuracy improves during training but as expected this takes time and it seems to flattened out at around 8%.

Is there anything that could cause BatchNorm to act like this? In case it is helpful I have included a simplified version of my code below (the final model will use both video and audio as a mel spectrogram whereas this example is just showing audio).

class MobileNetV1(nn.Module):
    def __init__(self, dims, input_channels, embedding_size=1000):
        self.dims = dims
        super(MobileNetV1, self).__init__()
        assert 1 <= dims <= 3

        opt_conv = [None, nn.Conv1d, nn.Conv2d, nn.Conv3d]
        opt_bn = [None, nn.BatchNorm1d, nn.BatchNorm2d, nn.BatchNorm3d]
        opt_pool = [None, nn.AvgPool1d, nn.AvgPool2d, nn.AvgPool3d]

        Conv = opt_conv[dims]
        BatchNorm = opt_bn[dims]
        self.AvgPool = opt_pool[dims]

        def conv_bn(inp, oup, stride):
            return nn.Sequential(
                Conv(inp, oup, 3, stride, padding=1, bias=False),
                BatchNorm(oup),
                nn.ReLU(inplace=True))

        def conv_dw(inp, oup, stride):
            return nn.Sequential(
                Conv(inp, inp, kernel_size=3, stride=stride, padding=1, groups=inp, bias=False),
                BatchNorm(inp),
                nn.ReLU(inplace=True),

                Conv(inp, oup, kernel_size=1, stride=1, padding=0, bias=False),
                BatchNorm(oup),
                nn.ReLU(inplace=True))

        self.features = nn.Sequential(
            conv_bn(input_channels, 32, 2),
            conv_dw(32, 64, 1),
            conv_dw(64, 128, 2),
            conv_dw(128, 128, 1),
            conv_dw(128, 256, 2),
            conv_dw(256, 256, 1),
            conv_dw(256, 512, 2),
            conv_dw(512, 512, 1),
            conv_dw(512, 512, 1),
            conv_dw(512, 512, 1),
            conv_dw(512, 512, 1),
            conv_dw(512, 512, 1),
            conv_dw(512, 1024, 2),
            conv_dw(1024, 1024, 1),
        )
        self.fc = nn.Linear(1024, embedding_size)

    def forward(self, x):
        x = self.features(x)
        avg_pool = self.AvgPool(x.data.size()[-self.dims:])
        x = avg_pool(x).view(-1, 1024)
        x = self.fc(x)
        return x

class AudioClassifier(MobileNetV1):
    def __init__(self, embedding_size=1000, train=True):
        super().__init__(embedding_size=embedding_size, dims=2, input_channels=1)

        self.classifier = nn.Linear(embedding_size, 51)

    def forward(self, x):
        x = super().forward(x)
        x = self.classifier(nn.ReLU()(x))
        return x

audioNet = mobilenets.AudioClassifier()

audioTrain(audioNet)

trainLoader = torch.utils.data.DataLoader(
        dataset.AudioVideoDataset(root=dataLocation + "/train", **train_transforms),
        batch_size=32, shuffle=True, **kwargs)

valLoader = torch.utils.data.DataLoader(
        dataset.AudioVideoDataset(root=dataLocation + "/val", **val_transforms),
        batch_size=64, shuffle=False, **kwargs)

optimizer = optim.SGD(audioNet.parameters(),
                      lr=0.01,
                      momentum=0.9,
                      dampening=0.9,
                      weight_decay=1e-3,
                      nesterov=False)

criterion = nn.CrossEntropyLoss()

def train(args, model, trainLoader, optimizer, criterion):
    model.train()  # switch to train mode
    losses = []

    for i, (audio, video, type, target, clip_id) in enumerate(trainLoader, 0):

        # zero the parameter gradients
        optimizer.zero_grad()

        audioOut = model(audio)

        loss = criterion(audioOut, target)

        loss.backward()
        losses.append(loss.item())

        optimizer.step()
    averageLoss = sum(losses) / len(losses)
    return averageLoss


def test(args, model, valLoader, criterion):
    model.eval()  # switch to evaluate mode
    losses = []
    correct = 0

    with torch.no_grad():
        for i, (audio, video, type, target, clip_id) in enumerate(valLoader):

            audioOut = model(audio)

            loss = criterion(audioOut, target)

            losses.append(loss.item())

            pred = audioOut.max(1, keepdim=True)[1]
            correct += pred.eq(target.view_as(pred)).sum().item()

    accuracy = 100. * correct / len(valLoader.dataset)
    averageLoss = sum(losses) / len(losses)
    return averageLoss

ptrblck · June 4, 2020, 6:44am

How large is your batch size?
The validation loss seems to be much larger than the training loss, which seems weird.
Did you try to remove the batchnorm layers completely and retrained the model?

JackFurby · June 5, 2020, 11:03am

Batch size is 32 (I have also tried 16) for training and 64 for validation.

The graph in the original post was from the more complex mode which might explain the higher loss. The validation loss in the simpler example show above is not so different (example shown below).

I have now tried without batchnorm layers which resulted in the same issue. The attached image is of that run.

What is interesting looking at the model training is even from the start the output is always a single class per batch (the same class repeated for every sample). The predicted class will just change slightly during training which would probably explain the one jump in accuracy.

ptrblck · June 5, 2020, 11:01pm

Thanks for the follow up. The new experiment seems to indicate that the issue is not in the batchnorm layer, but apparently the general training.

I cannot see any obvious errors in the code, so I would recommend to try to overfit the model on a small data subset first (e.g. just 10 sample) and make sure the model is able to do so.
If that’s not the case, there might be a bug I haven’t seen in the code snippets or in another part of the code, which is not posted.

JackFurby · June 9, 2020, 6:00pm

I think I have now found the issue. By the looks of it the audio input seems to the cause of the problem. I have not worked out how to fix that yet. Once I have a solution for that, if it is useful, I will update this thread with such.

To get to this outcome I tried to the train the model using an implementation with the CIFAR-10 dataset. This worked but on moving back to my dataset the issue returned. I also tried with the video modality (as a series of frames) which worked.

Thanks ptrblock for your help. I would have tried your latest suggestion but I was already part way through testing with CIFAR-10 which ended up with leading me to where I am now.