Retrain a pretrained model from scratch without restarting colab kernel

Ajinkya_Ambatwar · May 5, 2020, 8:04am

Hi everyone,

I am doing a set of experiments on varying network hyperparameters and checking the effect on the accuracy on the model.
For this I have a list of set of hyperparameter values to compare with.
I want to train the model in a loop starting from scratch in each run in google colab and then log results of each run using mlflow.
Is there a way I can train from sratch in a loop without having to restart the jupyter kernel everytime?

Help will be really appreciated!

Thank you!!

ptrblck · May 6, 2020, 12:36am

There are a few valid approaches.
You could:

just recreate the model and optimizer
call reset_parameters() on each layer of your model
or store the initial state_dict of the randomly initialized model and load it for each reset.

Let me know, if one of these approaches would work for you.

Ajinkya_Ambatwar · May 9, 2020, 5:29am

Hi @ptrblck, thanks for your suggestion and apologies for delayed reply. I instead created the network class that will take the network hyperparameters like num of output channels, filter size etc. as input param hence I need not reset the network each time

Ajinkya_Ambatwar · May 9, 2020, 3:16pm

@ptrblck for one other scenario, I am trying to test out one network with varying batch size, learning rate, momentum and more importantly network initialization method. I am trying it out like this

best_net = Net(conv1_kernel_size=3, conv1_output_channels=24, conv2_kernel_size=5, conv2_output_channels=24, fc1_output_size=200)
for bs in batch_size:
    trainloader = torch.utils.data.DataLoader(trainset, batch_size = bs, shuffle = True, num_workers = 2)
    testloader = torch.utils.data.DataLoader(testset, batch_size = bs, shuffle = True, num_workers = 2)
    for init in inits:
        apply_init(best_net, init)
        init_dict = best_net.state_dict()
        for lr in lrs:
            best_net.load_state_dict(init_dict)
            # optimizer.reset()
            optimizer = optim.SGD(best_net.parameters(), lr = lr, momentum = momentum)

Where apply init will just call the following function

def init_weight_xu(m):
    classname = m.__class__.__name__
    if classname.find('Conv') != -1 or classname.find('Linear') != -1:
        torch.nn.init.xavier_uniform_(m.weight) 
        m.bias.data.fill_(0.1)

The first run went well starting at 84% test acc and going till 91% in 15 epochs. But in 2nd run, with new lr, it started with 17%. I doubt that something is not getting updated. Either the network is not getting reset or the optimizer.
Can you please help me with this?

ptrblck · May 9, 2020, 7:34pm

Does your model only contain conv and linear layers? I.e. no batchnorm layers or others?
If you are using batchnorm layers, their running estimates as well as affine parameters won’t be reset.

Also, use copy.deepcopy(base_net.state_dict()), as init_dict will hold a reference to the state_dict.

Ajinkya_Ambatwar · May 10, 2020, 5:23am

My model has only linear and conv layers. Batch norm layers are not there.
Also I think the particular lr, bs etc configuration was the reason for low accuracy as I observed that in the next run with other lr value etc. the network was again showing expected performance.

Is that expected to happen?

ptrblck · May 10, 2020, 6:43am

Yes, a proper learning rate is essential to train a deep learning model and the batch size is also a hyperparameter, which influences the training.

Ajinkya_Ambatwar · May 10, 2020, 10:22am

Noted that…thanks!!

I tried xavier_normal initialization with some set of parameters of bs, lr etc as mentioned above and it worked well. Now I want to try with normal initlalization and same set of parameters.
I am trying it out as

best_net = Net(conv1_kernel_size=3, conv1_output_channels=24, conv2_kernel_size=5, conv2_output_channels=24, fc1_output_size=200)
best_net.apply(init_weight_normal)

where init_weight_normal is a function

def init_weight_normal(m):
    classname = m.__class__.__name__
    if classname.find('Conv') != -1 or classname.find('Linear') != -1:
        torch.nn.init.normal_(m.weight)
        m.bias.data.fill_(0.1)

And in the main loop for each iteration, I am calling best_net.apply(init_weight_normal)
i.e. just resetting the network to normal initialization, which worked well in case of xavier normal.
But now, I am observing that for every set of parameters, the loss is steady at 2.303 and test accuracy is fixed at 10% for all epochs and in some abrupt run at some epoch loss going to some 9999543434.0 etc.
I am not able to deduce what’s going wrong in case of normal initialization.

ptrblck · May 11, 2020, 12:24am

Especially if your model is deep, i.e. it contains a lot of layers, the initialization is an important step to let the model train. It was one of the key ingredients to be able to train deep neural networks at all, so I’m not surprised if a lot if initialization methods fail.
If your loss is increasing, your model might be diverging. Sometimes this is due to the missed step of zeroing out the gradients, but I assume you haven’t changed the training loop.

Ajinkya_Ambatwar · May 11, 2020, 8:51am

@ptrblck sir, my model is not deep. It has just 2 conv layers and 2 linear layers. The loss is not increasing, its steady at 2.303 and suddenly for just one epoch is goes to non justified value and then comes back to 2.303 in the next epoch. The model accuracy(test) though remains fixed at 10%.
Speaking of zeroing the gradients, I am doing that for each training iteration.
Attaching the training loop structure for your reference:

#training with model hyperparams
init = "normal"
for bs in batch_size:
    trainloader = torch.utils.data.DataLoader(trainset, batch_size = bs, shuffle = True, num_workers = 2)
    testloader = torch.utils.data.DataLoader(testset, batch_size = bs, shuffle = True, num_workers = 2)
        
    # init_dict = best_net.state_dict()
    for lr in lrs:
        for momentum in momentums:
                best_net.apply(init_weight_normal)
                # should I try this method instead?
                # best_net = Net(conv1_kernel_size=3, conv1_output_channels=24, conv2_kernel_size=5, conv2_output_channels=24, fc1_output_size=200)
                # best_net.load_state_dict(init_dict)

                optimizer = optim.SGD(best_net.parameters(), lr = lr, momentum = momentum)
                
                print("Currently processing init="+init+" bs="+str(bs)+" lr="+str(lr)+" mom="+str(momentum))
                    # Ignore the indentation here
                    for epoch in range(num_epochs):
                        # Was trying this, but observed that zero grad is present in the train function itself
                        #optimizer.zero_grad()
                        #best_net.zero_grad()
                        print('Epoch ', epoch+1, ' LR ', lr)
                        rloss,t_accuracy = train(epoch, trainloader, optimizer, criterion, best_net)
                        acc_score = test(testloader, best_net)
                        loss_list.append(rloss)
                        acc_list.append(acc_score)

The train function -

def train(epoch, trainloader, optimizer, criterion, net):
    running_loss = 0.0
    total = 0
    correct = 0
    for i, data in enumerate(tqdm(trainloader), 0):
        inputs, labels = data
        if torch.cuda.is_available():
            inputs, labels = inputs.cuda(), labels.cuda()
        optimizer.zero_grad()
        outputs = net(inputs)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()
        loss = criterion(outputs, labels)
        
        #backward pass
        loss.backward()
        
        #weight update
        optimizer.step()
        running_loss += loss.item()
    print('epoch %d training loss: %.3f' %(epoch + 1, running_loss/(len(trainloader))))
    return [running_loss/(len(trainloader)), 100*correct/total]

And the network architecture summary -

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1           [-1, 16, 24, 24]             416
            Conv2d-2           [-1, 16, 22, 22]           2,320
            Linear-3                 [-1, 1000]       7,745,000
            Linear-4                   [-1, 10]          10,010
================================================================

Also I just observed that for some set of configuration, the training loss is showing as ‘nan’.
I am doing something wrong wrt to the initialization. Doesn’t normal initialization work well? Or am I initializing it the wrong way?
If not what other initialization methods should I try apart from xavier-normal?

Ajinkya_Ambatwar · May 15, 2020, 6:56am

Can someone explain me this weird behaviour with my training?

I am trying out kaiming_normal initialization instead of normal?
The training loop is as shown in the post above this.

@ptrblck