The labels in dataset changes during training

Ahmed_Rashad · April 25, 2022, 11:23pm

Hi,
I am working on hyperspectral images super-resolution problem. The inputs for the model are rgb images and its corresponding hyperspectral image as labels. In the training phase the dataset are all normalized between [0,1]. But the issue is that some of HSI labels’ values changes during training and not normalized any more without any reasons. I tried different learning rate values but without any luck. I am also use torch.autograd.set_detect_anomaly(True) to detect any anomalies but didnot work either. please any help.
Thanks in advance

InnovArul · April 26, 2022, 12:27am

I am curious to know the link between the learning rate and unnormalized HSI label values. Is there an explicit dependency between them or so?
Also, without a reproducible code snippet, it would be tough to spot this kind of error/possible bug in your code (i.e., not normalized anymore without any reasons).

Ahmed_Rashad · April 26, 2022, 6:16am

Thanks for reply,
I believe there is no relation between the learning rate and the unnormlization that occurs, but I am trying all possible solution that i can. The training code is as follow:

def train(model, criterion, optimizer, train_loader, lr_scheduler,epoch,opt):
      total_loss =  AverageMeter()
      losses = AverageMeter()
      losses_rgb = AverageMeter()
      random.shuffle(train_loader)
      prev_time = time.time()
      model.train()
      for _,train_loader_data  in enumerate(train_loader):
         for i, data in enumerate(train_loader_data):
           model.zero_grad()
           optimizer.zero_grad()
           images, labels = data
           images, labels = images.cuda(), labels.cuda()
           ## the control statement to only predict the un normalized input error
           if labels.min()<0 or labels.max()>1:
              print("yes ther3 is problem in labels ","with min: {0} and max: 
                       {1}".format(labels.min(),labels.max()))
              logger2.info("Epoch [%02d], min_value:%.9f, max_value : %.9f,batch no: 
              %d/%d"   % (epoch, labels.min().detach().cpu(), 
                            labels.max().detach().cpu(),i+1,len(train_loader))) 
          else:
                lr_scheduler.step()
                fake_hyper = model.forward(images)
                #loss = criterion(fake_hyper, real_hyper)
                loss , loss_rgb = criterion(fake_hyper, labels, images)
                loss_all = loss + opt.trade_off * loss_rgb
                loss_all.backward()
                optimizer.step()
                # # Determine approximate time left
                iters_done = epoch * len(train_loader_data) + i
                iters_left =opt.epochs * len(train_loader_data) - iters_done
                time_left = datetime.timedelta(seconds = iters_left * (time.time() - 
                prev_time))
               prev_time = time.time()
               #  record loss
               losses.update(loss.data)
               losses_rgb.update(loss_rgb.data)
               total_loss.update(loss_all.data)
              print('[Epoch:%02d],[Batch no:%d/%d],[Time_left=%s], 
              [train_losses.avg=%.9f], 
                [rgb_train_losses.avg=%.9f]'
                  % (epoch, i+1, len(train_loader_data), time_left,losses.avg, 
           losses_rgb.avg))
       return total_loss.avg, losses.avg,losses_rgb.avg

and the code for the custom loss:

class LossTrainCSS(nn.Module):
    def __init__(self):
        super(LossTrainCSS, self).__init__()
        self.model_hs2rgb = nn.Conv2d(31, 3, 1, bias=False)
        filtersPath = './cie_1964_w_gain.npz'
        cie_matrix = np.load(filtersPath)['filters']
        cie_matrix = torch.from_numpy(np.transpose(cie_matrix, [1, 
        0])).unsqueeze(-1).unsqueeze(-1).float()
        self.model_hs2rgb.weight.data = cie_matrix

    def forward(self, outputs, label, rgb_label):
        rrmse = self.mrae_loss(outputs, label)
        # hs2rgb
        with torch.no_grad():
            rgb_tensor = self.model_hs2rgb(outputs)
            rgb_tensor = rgb_tensor / 255
            rgb_tensor = torch.clamp(rgb_tensor, 0, 1) * 255
            # rgb_tensor = torch.tensor(rgb_tensor, dtype=torch.uint8)
            # rgb_tensor = torch.tensor(rgb_tensor, dtype=torch.uint8)
            # update from torch it self is the line below , the original line is below 
            # the written one
            rgb_tensor = rgb_tensor.clone().detach().byte().float()
            #rgb_tensor = torch.tensor(rgb_tensor).byte().float()
            rgb_tensor = rgb_tensor / 255
        rrmse_rgb = self.rgb_mrae_loss(rgb_tensor, rgb_label)
        return rrmse, rrmse_rgb

    def mrae_loss(self, outputs, label):
        error = torch.abs(outputs - label) / label
        mrae = torch.mean(error.view(-1))
        return mrae



def rgb_mrae_loss(self, outputs, label):
        error = torch.abs(outputs - label)
        mrae = torch.mean(error.view(-1))
        return mrae

thanks in advance

InnovArul · April 26, 2022, 8:24am

Would you be able to print the range of values in your dataloader and in your training loop?
Also, is the data type of labels long?

Ahmed_Rashad · April 26, 2022, 11:14am

That’s why I put the control statement inside the loop training. The data values are all double values from [0,1.0] at the beginning of training. The dataset is prepared and cropped and saved offline before the training start. After some epochs the range values exceeds the max range of 1.0 to become 1.8444354 and continue in some cases to increase much higher. There is no contact from the model to the labels except for the loss calculations. I don’t know the reason for such behavior.
Thanks in advance,

InnovArul · April 26, 2022, 11:24am

In that case, You could debug this without even using the model. There is no need to complicate it together with the model I guess.

Just run the for loop without model forward/backward code, check the sanity of labels, and locate the files which cause this issue.
Then take it from there to find out what caused this issue.

Ahmed_Rashad · April 26, 2022, 11:40am

great suggestion. I will start by doing that and isolate the problem step by step. I will keep you noticed. thanks for the suggestion