Why does the loss jump up value at the beginning of the epoch

(update: I found out the mistake in the code)

Assume we are training a convolutional network or a residual network or something similar to an image classification problem. Following the usual procedure, for each cycle (epoch) we compute the gradient over the database in batches and update our result for each batch. I see (for large enough batch size) that once a cycle ends and a new cycle starts then the loss function jumps up value and then gradually decreases again. In fact whereas the accuracy of the test set decreases, the loss function for the initial batch always jumps up a value compared to the last batch of the previous cycle. Here is an example:

Files already downloaded and verified
Files already downloaded and verified
Number of training examples: 50000
Number of test examples: 10000
Cycle:  1
	Batch 39 	Cost 1.9820496678352355
	...
	Batch 359 	Cost 1.0509697571396828
Started evaluating test accuracy...
	Test accuracy:  0.6138
Cycle:  2
	Batch 39 	Cost 1.7036092668771743
	...
	Batch 359 	Cost 0.7662393197417259
Started evaluating test accuracy...
	Test accuracy:  0.718
Cycle:  3
	Batch 39 	Cost 1.235978502035141
	...
	Batch 359 	Cost 0.6208606369793415
Started evaluating test accuracy...
	Test accuracy:  0.7179
Cycle:  4
	Batch 39 	Cost 1.0319311194121839
	...
	Batch 359 	Cost 0.5332359492778778
Started evaluating test accuracy...
	Test accuracy:  0.7554
Cycle:  5
	Batch 39 	Cost 0.8292788065969944
	....
	Batch 359 	Cost 0.482302039116621
Started evaluating test accuracy...
	Test accuracy:  0.7629
Cycle:  6
	Batch 39 	Cost 0.7198149688541889
	...
	Batch 359 	Cost 0.429959923774004
Started evaluating test accuracy...
	Test accuracy:  0.7904
Cycle:  7
	Batch 39 	Cost 0.668946772068739
	...
	Batch 359 	Cost 0.40095994248986244

I will post the code at the end of the message but I am trying to wrap my head around this kind of behavior. What happens is that at the beginning of each cycle the parameters are updated to decrease the cost over the first batch but then following that more updates are made in order to decrease the cost function with respect to following batches. This might potentially increase the cost of the first batch if the data is not shuffled good enough. However, I load my data using

training_loader_CIFAR10 = torch.utils.data.DataLoader(dataset=training_set_CIFAR10,
batch_size=128, shuffle=True)

So this is rather unexpected behavior for me. It is not so alarming though since in average it seems to go down however I would like to conceptually understand the reason for the jump in the cost function when starting a new cycle. Note:my batch size is 128 so the code below prints the loss function for each data set of 128*40=5120 data points in training set.

The code for the training part is below. The network structure which I dont post is just a custom made residual network consisting of basic blocks only.

#rnn is the defined network. cost_criterion is the cost function
#optimizer is the gradient descent method we choose, lr is the learning rate
#gamma is the learning rate decay factor and the schedule is the list of cycles
#which tells when the learning date should decay.

def train(cycles,cost_criterion,rnn,optimizer,lr,gamma,schedule):
    
    average_cost=0 #cost function for the batches in training set
    acc=0 #accuracy over the test set

    
    for e in range(cycles): #cycle through the database many times

        print('Cycle: ',e+1)
   
		#at any time given by the schedule array, we decrease learning rate by
		#lr -> lr*gamma
        if(e in schedule):
             lr*=gamma
             print('Changing learning rate. New learning rate is')
             for param_group in optimizer.param_groups:
                 param_group['lr'] = lr
                 print('%.5f'% param_group['lr']) 
             
                
              

        #put rnn in train mode
        rnn.train()
         
        #following for loop cycles over the training set in batches
        #of batch_number=128 using the training_loader object
        for i, (x, y) in enumerate(training_loader_CIFAR10 ,0):
        
            #here x,y will store data from the training set in batches 
            x, y = Variable(x).cuda(), Variable(y).cuda()
            

            h = rnn.forward(x) #calculate hypothesis over the batch
            
            cost = cost_criterion(h, y) #calculate cost the cost of the results
            
            optimizer.zero_grad() #set the gradients to 0
            cost.backward() # calculate derivatives wrt parameters
            optimizer.step() #update parameters

            average_cost=average_cost+cost.data[0]; #add the cost to the costs
            
            
            if i % 40 == 39: #print cost on each 40 iterations
              print('\tBatch', i, '\tCost', average_cost/40)
              average_cost=0;
        
        acc = test() #once a training over the whole dataset is complete 
                     #look at the accuracy before reiterating the training
        #scheduler.step(acc)  
        print('\tTest accuracy: ', acc) 

update: I just found it out as I was going over my message that (if i % 40 == 39) is not satisfied in the final batch so the average_cost=0 is not applied at the end of each epoch leading the cost of the first batch to be larger than usual. Adding i % 40 == 39 after each epoch resolves this issue.