Problems with constant training loss

skyunyoo · May 5, 2020, 11:39am

Hello.

I’m having a problem with constant training loss.
Specifically, I am in the process of segmentation in MRI Image using U-Net.
the data covers about 100,000 slices of grayscale 32x32size.
Data is randomly called for each epoch and the learning is repeated. (ex. 10 numpy files in total, 10 learning in one epoch and 1 validation)

The essence of the problem is that after approximately 3 epochs, I always get the same value of train loss.

Things I have tried:
“data pre-processing”

image = image*255/image.max()
image = image/(image.max()+0.00001)
image = image*255/image.max() +
image = image/(image.max()+0.00001)

“Remove BatchNorm in Network”
In U-Net’s double conv part,

Used nn.BatchNorm2d after each Conv2d
Didn't used nn.BatchNorm2d after each Conv2d

“Learning Rate & Optimizer”

Used SGD or Adam
Used learning rates in the range of 0.00001 to 0.5

“etc…”

I wrote down the code of my custom dataset, u-net network, train / valid loop, etc. below.

Custom dataset

class eDataset(torch.utils.data.Dataset):
    def __init__(self, i, data_path, augmentation=True):
        self.data_path = data_path
        self.data = np.load(data_path+'Patch_images_{}.npy'.format(i)).astype(np.uint16)
        self.target = np.load(data_path+'Patch_Tumor_{}.npy'.format(i)).astype(np.uint8)

        self.augmentation = augmentation
    def __getitem__(self, index):
        x = self.data[index]
        y = self.target[index]
        x, y = self.transform(x, y)
            
        return x, y
    
    def transform(self, data, target):
        data, target = train_data(data, target, self.augmentation)
        return data, target
    
    def __len__(self):
        return len(self.data)

def train_data(image, mask, aug=True):
    image = Image.fromarray(image)
    mask = Image.fromarray(mask)

    image = TF.to_tensor(image).float()
    image = image/(image.max()+0.00001)
    mask = binarize(TF.to_tensor(mask)).float()
    return image, mask

U-Net Network & Hyper Parameters

def double_conv(in_channels, out_channels):
    return nn.Sequential(
        nn.Conv2d(in_channels, out_channels, 3, padding=1),
        #nn.BatchNorm2d(out_channels),
        nn.ReLU(inplace=True),
        nn.Conv2d(out_channels, out_channels, 3, padding=1),
        nn.BatchNorm2d(out_channels),
        nn.ReLU(inplace=True)
    )   

class UNet(nn.Module):

    def __init__(self, n_class):
        super().__init__()
                
        self.dconv_down1 = double_conv(1, 32)
        self.dconv_down2 = double_conv(32, 64)
        self.dconv_down3 = double_conv(64, 128)
        self.dconv_down4 = double_conv(128, 256)        

        self.maxpool = nn.MaxPool2d(2)
        self.upsample = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True)        
        
        self.dconv_up3 = double_conv(128 + 256, 128)
        self.dconv_up2 = double_conv(64 + 128, 64)
        self.dconv_up1 = double_conv(32 + 64, 32)
        
        self.conv_last = nn.Conv2d(32, n_class, 1)
        
    def forward(self, x):
        conv1 = self.dconv_down1(x)
        x = self.maxpool(conv1)

        conv2 = self.dconv_down2(x)
        x = self.maxpool(conv2)
        
        conv3 = self.dconv_down3(x)
        x = self.maxpool(conv3)   
        
        x = self.dconv_down4(x)
        
        x = self.upsample(x)        
        x = torch.cat([x, conv3], dim=1)
        
        x = self.dconv_up3(x)
        x = self.upsample(x)        
        x = torch.cat([x, conv2], dim=1)       

        x = self.dconv_up2(x)
        x = self.upsample(x)        
        x = torch.cat([x, conv1], dim=1)   
        
        x = self.dconv_up1(x)
        out = self.conv_last(x)
        return out

model = UNet(n_class=2)
model = model
if torch.cuda.is_available():
    model = model.cuda()

class_weights = torch.tensor([1.0, 1.0]).cuda()
criterion = nn.CrossEntropyLoss(weight=class_weights).to(device)
optimizer = optim.SGD(model.parameters(),lr=0.00001)
exp_lr_scheduler = lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)

Train/Valid loop

init_state = copy.deepcopy(model.state_dict())
init_state_opt = copy.deepcopy(optimizer.state_dict())
init_state_lr = copy.deepcopy(exp_lr_scheduler.state_dict())

since = time.time()
    
train_losses = []
val_losses = []

early_stopping = EarlyStopping(patience=5, verbose=1)
for epoch in range(num_epochs):
    print()
    print('Epoch {}/{}'.format(epoch, num_epochs - 1))
    print('-' * 10)
    epoch_loss = train_fit(epoch,model,phase='train')
    val_epoch_loss = valid_fit(epoch,model,validloader,phase='valid')
    train_losses.append(epoch_loss)
    val_losses.append(val_epoch_loss)
        
    if early_stopping.validate(val_epoch_loss):
        break

def train_fit(epoch,model,phase='train',volatile=False):
    torch.set_num_threads(4)
    epoch_loss = 0.0
    
    model.train().to(device)
    
    patient_index = list(range(1,11))
    for i in range(10):
        secure_random = random.SystemRandom()
        random_patient = secure_random.choice(patient_index)
        train_datasets = trainDataset(random_patient,"data_path/",augmentation=True)
        patient_index.remove(random_patient)
        data_loader = torch.utils.data.DataLoader(train_datasets, batch_size = batch_size, shuffle=True, num_workers=0, pin_memory=False)
    
        running_loss = 0.0
        for batch_idx , (data,target) in enumerate(data_loader):
            inputs,target = data.to(device),target.to(device)
            optimizer.zero_grad()
            with torch.set_grad_enabled(phase == 'train'):         
                output = model(inputs).to(device)
                loss = criterion(output,target.long()).to(device)
                if phase == 'train':
                    loss.backward()
                    optimizer.step()
            running_loss += loss.item()*inputs.size(0)
        if phase == 'train':
            exp_lr_scheduler.step()   
        loss = running_loss/len(data_loader.dataset)
        epoch_loss += loss
    epoch_loss = epoch_loss/10
    print('{} Loss: {:.4f}'.format(phase, epoch_loss))
    return epoch_loss

def valid_fit(epoch,model,data_loader,phase='train',volatile=False):
    torch.set_num_threads(4)
    if phase == 'train':
        model.train().to(device)
    if phase == 'valid':
        model.eval().to(device)

    running_loss = 0.0
    for batch_idx , (data,target) in enumerate(data_loader):
        inputs,target = data.to(device),target.to(device)
        optimizer.zero_grad()
        with torch.set_grad_enabled(phase == 'train'):         
            output = model(inputs).to(device)
            loss = criterion(output,target.long()).to(device)
            if phase == 'train':
                loss.backward()
                optimizer.step()
        running_loss += loss.item()*inputs.size(0)
    if phase == 'train':
        exp_lr_scheduler.step()   
    loss = running_loss/len(data_loader.dataset)
    print('{} Loss: {:.4f}'.format(
                phase, loss))
    return loss

The content is rather long, but if there are any parts I am missing or I am making mistakes, I would appreciate any help.
Thanks!

ptrblck · May 6, 2020, 5:37am

The code looks generally alright.
Could you describe what kind of transformation you are using for the dataset?
Since the data and target are both transformed, I assume that you are making sure that all random transformations are applied in the same way on both tensors?
Also, did you make sure that the target looks valid?

I would also recommend to try to overfit a small data sample (e.g. 10 samples) to make sure there are no bugs in the code we are missing.

skyunyoo · May 7, 2020, 7:11am

Thanks for your help.

First, the transformation I used is as follows.
I confirmed that augmentation is applied to the same image and mask.

def data_augmentation(image, mask, aug=True):
    image = Image.fromarray(image)
    mask = Image.fromarray(mask)

    if aug:
        if random.random() > 0.5:
            alpha = random.randint(100, 200)
            augmented = HorizontalFlip(p=1)(image=np.array(image), mask=np.array(mask))
            image = Image.fromarray(augmented['image'])
            mask = Image.fromarray(augmented['mask'])
            
        if random.random() > 0.5:
            alpha = random.randint(100, 200)
            augmented = Rotate(p=1, limit=45)(image=np.array(image), mask=np.array(mask))
            image = Image.fromarray(augmented['image'])
            mask = Image.fromarray(augmented['mask'])
            
        if random.random() > 0.5:
            alpha = random.randint(100, 200)
            augmented = Blur(p=1, blur_limit=5)(image=np.array(image), mask=np.array(mask))
            image = Image.fromarray(augmented['image'])
            mask = Image.fromarray(augmented['mask'])
         

    image = TF.to_tensor(image).float()
    image = image/(image.max()+0.00001)
    mask = binarize(TF.to_tensor(mask)).float()
    return image, mask

Also, as you advised, I tried learning with a small sample.

As a result of training, I found that train loss is still constant even in a small sample.

However, when learning without applying augmentation, it was confirmed that learning was normally performed.

I wonder why learning is not possible when augmentation is applied.
When augmentation is applied, is it done in epoch with little learning?

Any help would be appreciated. Thanks!

ptrblck · May 7, 2020, 7:30am

Where do these transformation come from?
My best guess is that these transformations (especially the blur) might be too aggressive.
Could you lower the values a bit and check, if the training benefits from it?

skyunyoo · May 11, 2020, 1:49am

I’m sorry for the late thank you.
As you said, I applied blur only and checked it, and I got bad results.

However, I did several trials,
It seems that augmentation does not play a decisive role in constant train loss.

I reconsidered your previous answer and accessed the data again from the beginning, and I found it curious in the normalize part.

In my code,
image = Image.fromarray(image)
image = TF.to_tensor(image).float()
image = image/(image.max()+0.000000001)
In order to fit the data in the [0,1] range, each data was divided into .max () values to make each data into the [0,1] range.

The question in this part is that the max values of each data are different. Is it correct to set the range to [0,1] as each max rather than the max value of the entire data set?

Also, I saw that the data range should be normalized to [-1,1] through various posts.
However, when norm = transforms.Normalize([0.5], [0.5]),image = norm(image) is used, mean and std values of the entire image cannot be 0 and 1, respectively. Is it correct to apply this?

I’m always grateful for your help
Thanks!

ptrblck · May 11, 2020, 1:55am

It might be OK, if you apply the same preprocessing on the test set. However, you wouldn’t be able to use Normalize with the mean and std of the training set afterwards.

It comes down to your use case and what works better. The “standard” approach would be to standardize the data, i.e. such that it has a zero mean and unit variance.
However, you could also try to normalize the data to [-1, 1] and compare the results.

skyunyoo · May 12, 2020, 7:03am

Thank you for the reply.
Based on the method you confirmed, I tried all of [0,1] range, [-1,1] range, mean 0 and std 1 normalize.
However, all did not work properly, and while extracting the input, I found data with a max value of 0.

The reason why the data with the max value of 0 was generated seems to have occurred in the process of making a single image into a patch and dividing it by the max value for each patch.

Does the data with max value of 0 as input interfere with learning?
If I want to normalize the data with [0,1] range in the process of making an image as a patch and learning, is it correct to divide it by the max value of one original image?
Do I think of each patch as one image and divide it by the max value of each patch?

Thanks!

ptrblck · May 12, 2020, 7:50am

If you’ve created the patch with a max value of 0 by dividing by the max value of all patches (let’s call it patches_max), this would mean that patches_max would have to be extremely large.
Are you sure the zero value was created in this way?

Usually you wouldn’t normalize each instance with its min and max values, but would use the statistics from the training set. However, as I’m not familiar with your use case, I would still recommend to try out different methods.

If none is working, I would suggest to look into other parts of your training routine, which might be failing.