Hello everyone,
I have been using PyTorch for last couple of days. I have faced two encounters where I noticed some unusual loss pattern while training DNNs in a for loop. I am presenting one of those here.
I am trying to train an ensemble of 5 DNNs here.
for e in range(ensemble):
training = 'training_set_' + str(e) + '.npy'
labels = 'labels_' + str(e) + '.npy'
nomadata = NOMAdata(training, labels)
for layer in model.children():
if hasattr(layer, 'reset_parameters'):
layer.reset_parameters()
# model = DeepNOMA(structure)
for fold, (train_idx, test_idx) in enumerate(cv.split(nomadata)):
# creating the sampler
train_sampler = torch.utils.data.SubsetRandomSampler(train_idx)
test_sampler = torch.utils.data.SubsetRandomSampler(test_idx)
trainloader = DataLoader(nomadata, batch_size = 250, sampler = train_sampler)
testloader = DataLoader(nomadata, batch_size= 250, sampler = test_sampler)
# resetting the parameters
# model = DeepNOMA(structure)
running_loss = 0
if fold == 0:
for epoch in range(5):
for i, val in enumerate(trainloader):
inputs, targets = val
# clear the gradients
optimizer.zero_grad()
# model output
yhat = model(inputs)
# calculate loss
loss = lossfun(yhat, targets)
# backprop
loss.backward()
# update model parameter
optimizer.step()
# loggin training performance
running_loss += loss.item()
if i % 20 == 19:
# # calculating validation loss
model.eval()
validation_loss = 0
for j, batch in enumerate(testloader):
test, labels = batch
ypred = model(test)
runval = lossfun(ypred, labels)
validation_loss += runval.item()
writer.add_scalars('Training/Validation Loss', {'Training loss': running_loss/20, 'Validation Loss': validation_loss/j}, epoch*len(trainloader) + i)
model.train()
print(running_loss/20, validation_loss/j, fold, e)
running_loss = 0
# Saving the entire model
save_path = 'model' + str(e) + '.pth'
torch.save(model.state_dict(), save_path)
As you can see, each DNN is being trained with different datasets. Below is the value of loss functions printed in (training loss, validation loss, fold, ensemble index) format.
0.6160063549876214 0.5588574174678687 0 0
0.35422628968954084 0.34084207271084643 0 0
0.25784978866577146 0.2524514478264433 0 0
0.22559681087732314 0.22244494445998259 0 0
.
.
.
.
0.015405059373006225 0.005930508576295894 0 0
0.016571250976994634 0.005989199118557001 0 0
0.02103429418057203 0.0058264776904399344 0 1
0.02045736024156213 0.0058095609872705406 0 1
0.021323334984481336 0.0058715936229235 0 1
0.020745716243982314 0.005786245124358119 0 1
As you can see, although the loop proceeds into training a new model (the ensemble index changed from 0 to 1), the loss function is surprisingly small! (if you compare with the value at the top of the snippet). It seems like the parameter reset function was not working at all.
All these my experiences are suggesting that there must have been a better practice for training model within a for loop which is not known to me. Can anyone elaborate on this issue? Thanks in advance!