Loss problem in net finetuning

Hunbo · May 18, 2018, 1:02pm

I’m obviously doing something wrong trying to finetune this implementation of Segnet. This is my results with accuracy and loss in TensorBoard.

The loss graph has the right curve, but both functions present a very strange and wrong behaviour during the first training epoch. Based on accuracy, it almost looks like it performs finetuning correctly for the first epoch, then it starts from scratch.

This is the bare-bones of the code I’m working with:

def train(epoch):
    model.train()

    # update learning rate
    exp_lr_scheduler.step()

    total_loss = 0
    total_accuracy = 0

    # iteration over the batches
    for batch_idx, (img, gt) in enumerate(train_loader):

        input = Variable(img)
        target = Variable(gt)

        # initialize gradients
        optimizer.zero_grad()

        # predictions
        output = model(input)

        cr_en_loss = nn.CrossEntropyLoss()
        loss = cr_en_loss(output, target)
        loss.backward()
        optimizer.step()

        """
        Here I calculate accuracy for this batch and log results
        """

        total_loss += loss.data[0]
        total_accuracy += accuracy

    return total_loss / len(train_loader), total_accuracy / len(train_loader)

# create SegNet model
model = SegNet(input_channels, label_numbers)
th = torch.load('path/of/pretrained/weights.pth')
model.load_state_dict(th)

# finetuning - freezing all the net's layers but the last one
ftparams = ['conv11d.weight', 'conv11d.bias']
for name, param in model.named_parameters():
    if name not in ftparams:
        param.requires_grad = False

# define the optimizer
optimizer = optim.SGD(model.conv11d.parameters(), lr=lr, momentum=momentum)
exp_lr_scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=step_size, gamma=gamma)

transform_train = transforms.Compose([
    """
    Here I apply my transforms
    """
])

train_dataset = MyDataset(root_dir_img, root_dir_gt, transform_train)
train_loader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=batch_size,
    shuffle=True,
    num_workers=num_workers,
    pin_memory=True
)

for epoch in range(epochs):

    # training
    train_loss, train_acc = train(epoch)

Where is my mistake? Why my net forgets everything starting from the second epoch?

ptrblck · May 18, 2018, 1:09pm

While finetuning your model you have to make sure the learning rate is not too high, since the pre-trained model has already “good” weights.
How high is your learning rate?

Hunbo · May 18, 2018, 1:11pm

learning rate = 0.001
momentum = 0.5

Could this be one of this parameter fault? The change of behaviour after the first epoch looks really strange to me, almost like finetuning is going okay at the beginning, then starting from scratch in the second epoch.

ptrblck · May 18, 2018, 1:13pm

Try to lower the lr to 1e-4 or even 1e-5.
How did you chose the settings for StepLR?

Hunbo · May 18, 2018, 1:16pm

I’ll try right now and let you now in a minute.
This is the lr scheduler:

exp_lr_scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.1)

Since it’s a pretrained net and I only need to twitch a little the last layer, I was thinking about training it only for 15 epochs.

Hunbo · May 18, 2018, 1:19pm

I tried lr = 1e-4 and 1e-5 but the problem is still there, I think it has something to do with my implementation of this training method maybe? Or could it be the momentum?

ptrblck · May 18, 2018, 1:25pm

Well, you are apparently using an older version of PyTorch, but this shouldn’t be the problem here I think.
However, you should upgrade to the latest stable release, since e.g. Variables and tensors were merged.
You can find the install instructions of the website.

Also, you don’t have to reconstruct the criterion in each run.
Move the cr_en_loss = nn.CrossEntropyLoss() above the for loop.
This shouldn’t be the problem either.

Is the loss increase exactly happening after one full epoch?

Hunbo · May 18, 2018, 1:33pm

I’m stuck to the older version because of my work group, unfortunately I can’t upgrade to the new version right now.
The loss criterion is out the for loop in my version of the code, I put it there for easy-to-read purposes.

The loss drastical increase and accuracy decrease both happen during exactly one epoch, every time I run the experiment.
My guess is the net just use the pretrained weight during the first epoch (which are ok at their job), giving me good results. Starting from the second epoch the net fresh-starts without any weights, learning from zero.
This should explain why the accuracy drop so much (it’s a segmentation task, all images become almost completely black).

ptrblck · May 18, 2018, 1:36pm

OK, something seems to be broken. Could you post the whole code?
As far as I can tell, the current code looks good.

If you cannot post the code due to your work policy, could you have a look at the norm of the gradients in the first and second epoch?

Hunbo · May 18, 2018, 1:38pm

In a few moments I will post the whole code, no problem. I will comment some part to make it easier to read.

Hunbo · May 18, 2018, 1:47pm

This is the full code:

import argparse
import logger
import time
import torch
import torch.backends.cudnn as cudnn
import torch.nn as nn
import torch.optim as optim
import transforms
from data import MyDataset
from segnet import SegNet
from torch.autograd import Variable

def train(epoch):
    model.train()

    # update learning rate
    exp_lr_scheduler.step()

    total_loss = 0
    total_accuracy = 0

    # iteration over the batches
    for batch_idx, (img, gt) in enumerate(train_loader):

        if use_cuda:
            img = img.cuda(async=True)
            gt = gt.cuda(async=True)

        input = Variable(img)
        target = Variable(gt)

        # initialize gradients
        optimizer.zero_grad()

        # predictions
        output = model(input)

        """
        output is (24, 2, 224, 224)
        target is (24, 1, 224, 224)
        Here I change target.view() and type in order to use nn.CrossEntropyLoss()
        """
        
        tb = target.size(0)
        tc = target.size(1)
        th = target.size(2)
        tw = target.size(3)
        target_long = target.view(tb, th, tw).long()

        loss = cren_loss(output.cuda(), target_long.cuda())
        loss.backward()
        optimizer.step()

        """
        This is a segmentation task, so in the next part I compute how many 1 pixels are correctly classificated
        as 1 and how many 0 pixels are correctly 0. Then I simply calculate the mean of foreground and background
        accuracy.
        """
        
        output_pred = softmax(output)
        _, prediction = output_pred.max(dim=1)
        prediction = prediction.unsqueeze(1)

        mat_zero2zero = ((prediction == 0) * (target == 0)).int()
        mat_one2one = ((prediction == 1) * (target == 1)).int()

        prediction_back = mat_zero2zero.sum().float()
        target_back = target.numel() - target.sum()

        prediction_fore = mat_one2one.sum().float()
        target_fore = target.sum()

        acc_back = prediction_back / target_back
        acc_fore = prediction_fore / target_fore
        accuracy = (acc_back + acc_fore) / 2

        # TensorBoard logging
        info = {'train-loss': loss.data[0],
                'train-accuracy': accuracy}

        for tag, value in info.items():
            log.scalar_summary(tag, value, batch_idx + 1)

        print('batch: %5s | loss: %.3f | acc_back: %.3f | acc_fore: %.3f | acc: %.3f |'
              % (str(batch_idx + 1) + '/' + str(len(train_loader)),
                 loss.data[0],
                 acc_back,
                 acc_fore,
                 accuracy),
              time.strftime("%H:%M:%S", time.gmtime(time.time())),
              'training')

        total_loss += loss.data[0]
        total_accuracy += accuracy

    return total_loss / len(train_loader), total_accuracy / len(train_loader)


# training settings
parser = argparse.ArgumentParser(description='PyTorch SegNet')
parser.add_argument('--epochs', type=int, default=10, help='train epochs') 
parser.add_argument('--lr', type=float, default=0.0001, help='learning rate')
parser.add_argument('--momentum', type=float, default=0.5, help='SGD momentum')
parser.add_argument('--resume', '-r', action='store_true', help='resume from checkpoint')
args = parser.parse_args()

# cuda
use_cuda = torch.cuda.is_available()

input_nbr = 3
label_nbr = 2
img_size = 224

batch_size = 24
num_workers = 4

start_epoch = 0

softmax = torch.nn.Softmax(dim=1)

if use_cuda:
    cren_loss = nn.CrossEntropyLoss().cuda()
else:
    cren_loss = nn.CrossEntropyLoss()

# create SegNet model
model = SegNet(input_nbr, label_nbr)
model.load_from_filename('/path/to/pretrained/weights')

# convert to cuda if needed
if use_cuda:
    model.cuda()
    cudnn.benchmark = True
else:
    model.float()

# finetuning
ftparams = ['conv11d.weight', 'conv11d.bias']
for name, param in model.named_parameters():
    if name not in ftparams:
        param.requires_grad = False

# define the optimizer
optimizer = optim.SGD(model.conv11d.parameters(), lr=args.lr, momentum=args.momentum)
exp_lr_scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.1)

# define data
root_dir_img = '/path/to/img/dir'
root_dir_gt = './path/to/gt/dir'

transform_train = transforms.Compose([
    transforms.ColorJitter(brightness=0.1, contrast=0.1, saturation=0.1, hue=0.1),
    transforms.RandomResizedCrop(img_size),
    transforms.RandomHorizontalFlip(),
    transforms.RandomVerticalFlip(),
    transforms.ToTensor()
])

train_dataset = MyDataset(root_dir_img, root_dir_gt, transform_train)
train_loader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=batch_size,
    shuffle=True,
    num_workers=num_workers,
    pin_memory=True
)

# Set the logger
log = logger.Logger('./logs')

for epoch in range(start_epoch, start_epoch + args.epochs):
    print('epoch: %5s' % str(epoch+1))

    # training
    train_loss, train_acc = train(epoch)
    print('\nepoch: %5s | loss: %.3f | acc: %.3f |'
          % (str(epoch + 1) + '/' + str(start_epoch + args.epochs),
             train_loss,
             train_acc),
          time.strftime("%H:%M:%S", time.gmtime(time.time())),
          'training')

    print('\n')

ptrblck · May 18, 2018, 2:06pm

Thanks for the code. I am currently working on it creating some dummy data and targets.
One thing I’ve seen so far is the usage of transformation.
Since you are working on a segmentation task, I assume you have segmentation maps as the target.
I cannot see, how your Dataset is implemented, but if you are using some random transformations like RandomResizedCrop, and flipping, you have to take care of applying them also on your target.
Otherwise your input will be transformed and the model might have a hard time to learn the relationship between the input and target.

The easiest way would be to use the functional API of torchvision.
Here is a small example I created a while ago.

Let me know, if this helps!

Hunbo · May 18, 2018, 2:12pm

The transformations are already applied both on images and ground truths where needed.

The dataset consist of some objects and their binary segmentation map.
I could provide you the code I’m using for dataset creation / transforms / net implementation if this could help.

Anyway everything seems to work fine during the first epoch, accuracy is high and loss low, since the pretrained weights are good. The problem is the passage from the first epoch to the second, my guess is some parameter are not handled correctly.

How can I check the norm of the gradients you were talking about?

ptrblck · May 18, 2018, 2:14pm

Could you post the transformation part of your Dataset please?
Are you using the transform_train in it?

You can check if with model.conv11d.weight.grad.norm().

Hunbo · May 18, 2018, 2:20pm

This is my Dataset class.

import os
import torch.utils.data
from PIL import Image
from PIL import ImageFile


class MyDataset(torch.utils.data.Dataset):

    def __init__(self, root_dir_img, root_dir_gt, transform=None):

        self.root_dir_img = root_dir_img
        self.root_dir_gt = root_dir_gt
        self.transform = transform

        img_names = [os.path.join(root_dir_img, name) for name in os.listdir(root_dir_img) if
                     os.path.isfile(os.path.join(root_dir_img, name))]

        gt_names = [os.path.join(root_dir_gt, name) for name in os.listdir(root_dir_gt) if
                    os.path.isfile(os.path.join(root_dir_gt, name))]

        self.img_files = []
        self.gt_files = []

        for i in range(len(img_names)):
            self.img_files.append(Image.open(img_names[i]))
            self.gt_files.append(Image.open(gt_names[i]))

    def __len__(self):
        return len(self.img_files)

    def __getitem__(self, idx):

        ImageFile.LOAD_TRUNCATED_IMAGES = True

        img = self.img_files[idx]
        gt = self.gt_files[idx]

        sample = {'image': img, 'mask': gt}

        if self.transform:
            sample = self.transform(sample)
            img = sample['image']
            gt = sample['mask']

        return img, gt

I will check grad.norm() now.

Hunbo · May 18, 2018, 2:23pm

epoch:  1/10 | loss: 0.499 | acc: 0.877 | 14:18:40 training
Variable containing:
 0.5379
[torch.cuda.FloatTensor of size 1 (GPU 0)]

epoch:  2/10 | loss: 4.012 | acc: 0.506 | 14:18:48 training
Variable containing:
 2.0424
[torch.cuda.FloatTensor of size 1 (GPU 0)]

epoch:  3/10 | loss: 4.082 | acc: 0.504 | 14:18:57 training
Variable containing:
 2.2331
[torch.cuda.FloatTensor of size 1 (GPU 0)]

These are the stats and norm of the gradients for the first three epochs.

ptrblck · May 18, 2018, 2:45pm

Could you try to run your code with one or two images-mask pairs and see how your model is behaving then?
I still don’t see any obvious errors in your code, so we might have a look if the data is somehow corrupted/changed, even though you are not calling anything after the train() call, right?

Hunbo · May 18, 2018, 2:52pm

I’m not calling anything after the train function.
If i try running the net it works fine, it does a good job at segmenting using the pretrained weights.
But the model obtained after finetuning is unusable (as shown by accuracy drop from 85% to 50%).
I noticed that if I let the training process run for many epochs (100+) I get a working model, basically trained from scratch. This does not solve my problem, but I guess is just another confirmation that the whole thing “is working”, but the parameters “get lost” moving from epoch 1 to epoch 2.

ptrblck · May 18, 2018, 3:10pm

Yeah, I see the issue.
Could you remove the truncated images and try it again?
I still have the feeling the error is somehow related to the data.

EDIT: Also, could you remove the cuda() calls from this line:

loss = cren_loss(output.cuda(), target_long.cuda())

Hunbo · May 18, 2018, 3:15pm

Probably you are on the right lead.

I removed this line:

ImageFile.LOAD_TRUNCATED_IMAGES = True

And I got this error:

Traceback (most recent call last):
 File "/.../train.py", line 192, in <module>
   train_loss, train_acc = train(epoch)
 File "/.../train.py", line 28, in train
   for batch_idx, (img, gt) in enumerate(train_loader):
 File "/.../venv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 281, in __next__
   return self._process_next_batch(batch)
 File "/...e/venv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 301, in _process_next_batch
   raise batch.exc_type(batch.exc_msg)
OSError: Traceback (most recent call last):
 File "/.../venv/lib/python3.6/site-packages/PIL/ImageFile.py", line 215, in load
   s = read(self.decodermaxblock)
 File "/.../venv/lib/python3.6/site-packages/PIL/PngImagePlugin.py", line 619, in load_read
   cid, pos, length = self.png.read()
 File "/.../venv/lib/python3.6/site-packages/PIL/PngImagePlugin.py", line 114, in read
   length = i32(s)
 File "/.../venv/lib/python3.6/site-packages/PIL/_binary.py", line 76, in i32be
   return unpack(">I", c[o:o+4])[0]
struct.error: unpack requires a buffer of 4 bytes

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
 File "/.../venv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 55, in _worker_loop
   samples = collate_fn([dataset[i] for i in batch_indices])
 File "/.../venv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 55, in <listcomp>
   samples = collate_fn([dataset[i] for i in batch_indices])
 File "/.../data.py", line 41, in __getitem__
   sample = self.transform(sample)
 File "/.../transforms.py", line 584, in __call__
   sample = t(sample)
 File "/.../transforms.py", line 1074, in __call__
   img = transform(img)
 File "/.../transforms.py", line 584, in __call__
   sample = t(sample)
 File "/.../transforms.py", line 794, in __call__
   return self.lambd(img)
 File "/.../transforms.py", line 1048, in <lambda>
   transforms.append(Lambda(lambda img: adjust_contrast(img, contrast_factor)))
 File "/.../transforms.py", line 462, in adjust_contrast
   enhancer = ImageEnhance.Contrast(img)
 File "/.../venv/lib/python3.6/site-packages/PIL/ImageEnhance.py", line 66, in __init__
   mean = int(ImageStat.Stat(image.convert("L")).mean[0] + 0.5)
 File "/.../venv/lib/python3.6/site-packages/PIL/Image.py", line 879, in convert
   self.load()
 File "/.../venv/lib/python3.6/site-packages/PIL/ImageFile.py", line 220, in load
   raise IOError("image file is truncated")
OSError: image file is truncated