Hi,
I am working on hyperspectral images super-resolution problem. The inputs for the model are rgb images and its corresponding hyperspectral image as labels. In the training phase the dataset are all normalized between [0,1]. But the issue is that some of HSI labels’ values changes during training and not normalized any more without any reasons. I tried different learning rate values but without any luck. I am also use torch.autograd.set_detect_anomaly(True) to detect any anomalies but didnot work either. please any help.
Thanks in advance
I am curious to know the link between the learning rate and unnormalized HSI label values. Is there an explicit dependency between them or so?
Also, without a reproducible code snippet, it would be tough to spot this kind of error/possible bug in your code (i.e., not normalized anymore without any reasons).
Thanks for reply,
I believe there is no relation between the learning rate and the unnormlization that occurs, but I am trying all possible solution that i can. The training code is as follow:
def train(model, criterion, optimizer, train_loader, lr_scheduler,epoch,opt):
total_loss = AverageMeter()
losses = AverageMeter()
losses_rgb = AverageMeter()
random.shuffle(train_loader)
prev_time = time.time()
model.train()
for _,train_loader_data in enumerate(train_loader):
for i, data in enumerate(train_loader_data):
model.zero_grad()
optimizer.zero_grad()
images, labels = data
images, labels = images.cuda(), labels.cuda()
## the control statement to only predict the un normalized input error
if labels.min()<0 or labels.max()>1:
print("yes ther3 is problem in labels ","with min: {0} and max:
{1}".format(labels.min(),labels.max()))
logger2.info("Epoch [%02d], min_value:%.9f, max_value : %.9f,batch no:
%d/%d" % (epoch, labels.min().detach().cpu(),
labels.max().detach().cpu(),i+1,len(train_loader)))
else:
lr_scheduler.step()
fake_hyper = model.forward(images)
#loss = criterion(fake_hyper, real_hyper)
loss , loss_rgb = criterion(fake_hyper, labels, images)
loss_all = loss + opt.trade_off * loss_rgb
loss_all.backward()
optimizer.step()
# # Determine approximate time left
iters_done = epoch * len(train_loader_data) + i
iters_left =opt.epochs * len(train_loader_data) - iters_done
time_left = datetime.timedelta(seconds = iters_left * (time.time() -
prev_time))
prev_time = time.time()
# record loss
losses.update(loss.data)
losses_rgb.update(loss_rgb.data)
total_loss.update(loss_all.data)
print('[Epoch:%02d],[Batch no:%d/%d],[Time_left=%s],
[train_losses.avg=%.9f],
[rgb_train_losses.avg=%.9f]'
% (epoch, i+1, len(train_loader_data), time_left,losses.avg,
losses_rgb.avg))
return total_loss.avg, losses.avg,losses_rgb.avg
and the code for the custom loss:
class LossTrainCSS(nn.Module):
def __init__(self):
super(LossTrainCSS, self).__init__()
self.model_hs2rgb = nn.Conv2d(31, 3, 1, bias=False)
filtersPath = './cie_1964_w_gain.npz'
cie_matrix = np.load(filtersPath)['filters']
cie_matrix = torch.from_numpy(np.transpose(cie_matrix, [1,
0])).unsqueeze(-1).unsqueeze(-1).float()
self.model_hs2rgb.weight.data = cie_matrix
def forward(self, outputs, label, rgb_label):
rrmse = self.mrae_loss(outputs, label)
# hs2rgb
with torch.no_grad():
rgb_tensor = self.model_hs2rgb(outputs)
rgb_tensor = rgb_tensor / 255
rgb_tensor = torch.clamp(rgb_tensor, 0, 1) * 255
# rgb_tensor = torch.tensor(rgb_tensor, dtype=torch.uint8)
# rgb_tensor = torch.tensor(rgb_tensor, dtype=torch.uint8)
# update from torch it self is the line below , the original line is below
# the written one
rgb_tensor = rgb_tensor.clone().detach().byte().float()
#rgb_tensor = torch.tensor(rgb_tensor).byte().float()
rgb_tensor = rgb_tensor / 255
rrmse_rgb = self.rgb_mrae_loss(rgb_tensor, rgb_label)
return rrmse, rrmse_rgb
def mrae_loss(self, outputs, label):
error = torch.abs(outputs - label) / label
mrae = torch.mean(error.view(-1))
return mrae
def rgb_mrae_loss(self, outputs, label):
error = torch.abs(outputs - label)
mrae = torch.mean(error.view(-1))
return mrae
thanks in advance
Would you be able to print the range of values in your dataloader and in your training loop?
Also, is the data type of labels long?
That’s why I put the control statement inside the loop training. The data values are all double values from [0,1.0] at the beginning of training. The dataset is prepared and cropped and saved offline before the training start. After some epochs the range values exceeds the max range of 1.0 to become 1.8444354 and continue in some cases to increase much higher. There is no contact from the model to the labels except for the loss calculations. I don’t know the reason for such behavior.
Thanks in advance,
In that case, You could debug this without even using the model. There is no need to complicate it together with the model I guess.
Just run the for loop without model forward/backward code, check the sanity of labels, and locate the files which cause this issue.
Then take it from there to find out what caused this issue.
great suggestion. I will start by doing that and isolate the problem step by step. I will keep you noticed. thanks for the suggestion