Shuffling input

pedram1 · May 24, 2020, 7:30pm

Hi All

I am working on hyperspectral image as input of my NN.
this is the shape of my data set (65000,1,92) - 65000 is number of samples (signal) , 1 is channel and 92 is length of my signal.
When I trained my model it turned out that error of test data is less than error of training.
I assume that it has to do with how i shuffle my data. because signals are distributed randomly in my pixels. (every single pixel is a signal)
below you can see my dataloader code, has anyone have any idea if it is OkAY?

data_set = torch.utils.data.TensorDataset(new_input_tensor,new_target_tensor)
n_train = int(len(data_set)*0.80)
n_test = int(len(data_set)) - n_train
train_data_set, test_data_set = torch.utils.data.random_split(data_set, [n_train, n_test])
train_loader = torch.utils.data.DataLoader(train_data_set, batch_size = 2621, shuffle=True, drop_last=True)
test_loader = torch.utils.data.DataLoader(test_data_set, batch_size = 2621, shuffle=True, drop_last=True)

Thank you,

ptrblck · May 25, 2020, 5:13am

How large is the gap between the training and test error?
If your model uses e.g. dropout layers, the loss might be (slightly) higher during training, as dropout reduces the model capacity of the model.

pedram1 · May 25, 2020, 2:37pm

Hi Patrick

Thank you for your reply.
Train Epoch : 400/400 |Loss : 0.0128
Test Epoch : 400/400 |Loss : 0.0032

Do you think the way I splitted up the dataset between test and training set is right?
Actually, Drop out is off.

ptrblck · May 26, 2020, 2:26am

The usage of torch.utils.data.random_split should be correct.
How are you calculating the training loss? Are you printing the average of the complete training epoch or is this the value of the last batch?
In the former case, are you observing also the training loss for each batch?

pedram1 · May 26, 2020, 3:53am

I used these:
###training loop, scheduler learning was used to find the best learning rate. in evert 250 epoch learning rate changes by the factor of 0.1
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=320, gamma=0.7, last_epoch=-1)

epoch_num = 0
train_error = []
test_error = []
best_error = 100
for epoch in range(num_epochs):
print(scheduler.get_last_lr())
loss_total = 0
test_loss_total = 0
model.train()
for batch_idx, sample in enumerate(train_loader):
inp = sample
inp = inp.cuda()
output = model(inp).cuda()
loss = criterion(output, inp)
loss_total += loss
optimizer.zero_grad()
loss.backward()
optimizer.step()
scheduler.step()
loss_total = loss_total / (batch_idx+1)
train_error.append(loss_total.item())

model.eval()
with torch.no_grad():
for batch_idx, sample in enumerate(test_loader):
inp = sample
inp = inp.cuda()
output = model(inp).cuda()
loss = criterion(output, inp)
test_loss_total += loss
test_loss_total = test_loss_total / (batch_idx+1)
if loss < best_error:
best_error = loss
best_epoch = epoch
print(‘Best loss at epoch’, best_epoch)
model_save_name = ‘2020-05-25 Li_sample_no_image_processing (5)’
path = F"/content/drive/My Drive/{model_save_name}"
torch.save(model.state_dict(), path)
test_error.append(test_loss_total.item())
if epoch%10 == 9:
epoch_num+=epoch_num
print ((’\r Train Epoch : {}/{} \tLoss : {:.4f}’.format (epoch+1,num_epochs,loss_total)))
print ((’\r Test Epoch : {}/{} \tLoss : {:.4f}’.format (epoch+1,num_epochs,test_loss_total)))

I do print the avg value over mini batch size

Thank you Patrick

ptrblck · May 26, 2020, 6:57am

You could try to add another loop after the training epoch to calculate the training loss after the model was trained without further training. This would calculate the current training loss for this epoch without using the running average, which might create a bias.

Let me know, if this reduces the gap or not.

PS: you can post code snippets by wrapping them into three backticks ```

pedram1 · May 26, 2020, 1:12pm

Thank you Patrick.
Just may I ask you to clear this to me more? You say calculating the final error? I mean output of NN? But then how this can explain the problem that I was addresed, smaller error in test then training.

Thank you

ptrblck · May 27, 2020, 3:38am

I think the gap might come from the training loss calculation which is using the running average of the batch losses in the current epoch (while the model is being trained), while the validation loss is calculated after the current epoch is finished.
To exclude this possibility, you could also calculate the training loss after the epoch was finished (similar as the validation loss is calculated) and check, if the gap narrows.

pedram1 · May 30, 2020, 9:55pm

Thank you patrick. I’ll try that.