(update: I found out the mistake in the code)
Assume we are training a convolutional network or a residual network or something similar to an image classification problem. Following the usual procedure, for each cycle (epoch) we compute the gradient over the database in batches and update our result for each batch. I see (for large enough batch size) that once a cycle ends and a new cycle starts then the loss function jumps up value and then gradually decreases again. In fact whereas the accuracy of the test set decreases, the loss function for the initial batch always jumps up a value compared to the last batch of the previous cycle. Here is an example:
Files already downloaded and verified
Files already downloaded and verified
Number of training examples: 50000
Number of test examples: 10000
Cycle: 1
Batch 39 Cost 1.9820496678352355
...
Batch 359 Cost 1.0509697571396828
Started evaluating test accuracy...
Test accuracy: 0.6138
Cycle: 2
Batch 39 Cost 1.7036092668771743
...
Batch 359 Cost 0.7662393197417259
Started evaluating test accuracy...
Test accuracy: 0.718
Cycle: 3
Batch 39 Cost 1.235978502035141
...
Batch 359 Cost 0.6208606369793415
Started evaluating test accuracy...
Test accuracy: 0.7179
Cycle: 4
Batch 39 Cost 1.0319311194121839
...
Batch 359 Cost 0.5332359492778778
Started evaluating test accuracy...
Test accuracy: 0.7554
Cycle: 5
Batch 39 Cost 0.8292788065969944
....
Batch 359 Cost 0.482302039116621
Started evaluating test accuracy...
Test accuracy: 0.7629
Cycle: 6
Batch 39 Cost 0.7198149688541889
...
Batch 359 Cost 0.429959923774004
Started evaluating test accuracy...
Test accuracy: 0.7904
Cycle: 7
Batch 39 Cost 0.668946772068739
...
Batch 359 Cost 0.40095994248986244
I will post the code at the end of the message but I am trying to wrap my head around this kind of behavior. What happens is that at the beginning of each cycle the parameters are updated to decrease the cost over the first batch but then following that more updates are made in order to decrease the cost function with respect to following batches. This might potentially increase the cost of the first batch if the data is not shuffled good enough. However, I load my data using
training_loader_CIFAR10 = torch.utils.data.DataLoader(dataset=training_set_CIFAR10,
batch_size=128, shuffle=True)
So this is rather unexpected behavior for me. It is not so alarming though since in average it seems to go down however I would like to conceptually understand the reason for the jump in the cost function when starting a new cycle. Note:my batch size is 128 so the code below prints the loss function for each data set of 128*40=5120 data points in training set.
The code for the training part is below. The network structure which I dont post is just a custom made residual network consisting of basic blocks only.
#rnn is the defined network. cost_criterion is the cost function
#optimizer is the gradient descent method we choose, lr is the learning rate
#gamma is the learning rate decay factor and the schedule is the list of cycles
#which tells when the learning date should decay.
def train(cycles,cost_criterion,rnn,optimizer,lr,gamma,schedule):
average_cost=0 #cost function for the batches in training set
acc=0 #accuracy over the test set
for e in range(cycles): #cycle through the database many times
print('Cycle: ',e+1)
#at any time given by the schedule array, we decrease learning rate by
#lr -> lr*gamma
if(e in schedule):
lr*=gamma
print('Changing learning rate. New learning rate is')
for param_group in optimizer.param_groups:
param_group['lr'] = lr
print('%.5f'% param_group['lr'])
#put rnn in train mode
rnn.train()
#following for loop cycles over the training set in batches
#of batch_number=128 using the training_loader object
for i, (x, y) in enumerate(training_loader_CIFAR10 ,0):
#here x,y will store data from the training set in batches
x, y = Variable(x).cuda(), Variable(y).cuda()
h = rnn.forward(x) #calculate hypothesis over the batch
cost = cost_criterion(h, y) #calculate cost the cost of the results
optimizer.zero_grad() #set the gradients to 0
cost.backward() # calculate derivatives wrt parameters
optimizer.step() #update parameters
average_cost=average_cost+cost.data[0]; #add the cost to the costs
if i % 40 == 39: #print cost on each 40 iterations
print('\tBatch', i, '\tCost', average_cost/40)
average_cost=0;
acc = test() #once a training over the whole dataset is complete
#look at the accuracy before reiterating the training
#scheduler.step(acc)
print('\tTest accuracy: ', acc)