Training an ensemble DNNs in a for loop

tanweer-mahdi · May 11, 2021, 4:57am

Hello everyone,

I have been using PyTorch for last couple of days. I have faced two encounters where I noticed some unusual loss pattern while training DNNs in a for loop. I am presenting one of those here.

I am trying to train an ensemble of 5 DNNs here.

for e in range(ensemble):
    training = 'training_set_' + str(e) + '.npy'
    labels = 'labels_' + str(e) + '.npy'
    nomadata = NOMAdata(training, labels)
    for layer in model.children():
        if hasattr(layer, 'reset_parameters'):
            layer.reset_parameters()
    # model = DeepNOMA(structure)


    for fold, (train_idx, test_idx) in enumerate(cv.split(nomadata)):

        # creating the sampler
        train_sampler = torch.utils.data.SubsetRandomSampler(train_idx)
        test_sampler = torch.utils.data.SubsetRandomSampler(test_idx)

        trainloader = DataLoader(nomadata, batch_size = 250, sampler  = train_sampler)
        testloader = DataLoader(nomadata, batch_size= 250, sampler = test_sampler)

        # resetting the parameters
        # model = DeepNOMA(structure)

        running_loss = 0
        if fold == 0:
            for epoch in range(5):
                for i, val in enumerate(trainloader):
                    inputs, targets = val
                    # clear the gradients
                    optimizer.zero_grad()
                    # model output
                    yhat = model(inputs)
                    # calculate loss
                    loss = lossfun(yhat, targets)
                    # backprop
                    loss.backward()
                    # update model parameter
                    optimizer.step()
                    # loggin training performance
                    running_loss += loss.item()

                    if i % 20 == 19:
                        # # calculating validation loss
                        model.eval()
                        validation_loss = 0
                        for j, batch in enumerate(testloader):
                            test, labels = batch
                            ypred = model(test)
                            runval = lossfun(ypred, labels)
                            validation_loss += runval.item()

                        writer.add_scalars('Training/Validation Loss', {'Training loss': running_loss/20, 'Validation Loss': validation_loss/j}, epoch*len(trainloader) + i)

                        model.train()
                        print(running_loss/20, validation_loss/j, fold, e)
                        running_loss = 0







    # Saving the entire model
    save_path = 'model' + str(e) + '.pth'
    torch.save(model.state_dict(), save_path)

As you can see, each DNN is being trained with different datasets. Below is the value of loss functions printed in (training loss, validation loss, fold, ensemble index) format.

0.6160063549876214 0.5588574174678687 0 0
0.35422628968954084 0.34084207271084643 0 0
0.25784978866577146 0.2524514478264433 0 0
0.22559681087732314 0.22244494445998259 0 0
.
.
.
.
0.015405059373006225 0.005930508576295894 0 0
0.016571250976994634 0.005989199118557001 0 0
0.02103429418057203 0.0058264776904399344 0 1
0.02045736024156213 0.0058095609872705406 0 1
0.021323334984481336 0.0058715936229235 0 1
0.020745716243982314 0.005786245124358119 0 1

As you can see, although the loop proceeds into training a new model (the ensemble index changed from 0 to 1), the loss function is surprisingly small! (if you compare with the value at the top of the snippet). It seems like the parameter reset function was not working at all.

All these my experiences are suggesting that there must have been a better practice for training model within a for loop which is not known to me. Can anyone elaborate on this issue? Thanks in advance!