Manually calculate loss over a number of images and then back propagate the average loss and update network weight

I am doing a task where the batch size is 1, i.e, each batch contains only 1 image. So I have to do manual batching: when the the number of accumulated losses reach a number, average the loss and then do back propagation.
My original code is:

real_batchsize = 200

for epoch in range(1, 5):
    net.train()

    total_loss = Variable(torch.zeros(1).cuda(), requires_grad=True)

    iter_count = 0
    for batch_idx, (input, target) in enumerate(train_loader):

        input, target = Variable(input.cuda()), Variable(target.cuda())
        output = net(input)

        loss = F.nll_loss(output, target)

        total_loss = total_loss + loss

        if batch_idx % real_batchsize == 0:
            iter_count += 1

            ave_loss = total_loss/real_batchsize
            ave_loss.backward()
            optimizer.step()

            if iter_count % 10 == 0:
                print("Epoch:{}, iteration:{}, loss:{}".format(epoch,
                                                           iter_count,
                                                           ave_loss.data[0]))
            total_loss.data.zero_() 
            optimizer.zero_grad()

This code will give the error message

RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

I have tried the following way,

first way(failed)

I read some post about this error message, but can not understand it fully. Change ave_loss.backward() to ave_loss.backward(retain_graph=True) prevent the error message, but the loss doesn’t improve the soon becomes nan.

second way (failed)

I also tried to change total_loss = total_loss + loss.data[0], this will also prevent the error message. But the loss are always the same. So there must be something wrong.

third way(success)

Following the instruction in this post, for each image’s loss, we divide the loss by real_batchsize and backprop it. When the number of input image reach the real_batchsize, I do one update of parameter using optimizer.step(). The loss is slowly decreasing as the training process goes. But the training speed is really slow, because we backprop for each image.

my question

What does the error message mean in my case? Also, why doesn’t first way and second way work? How to write the code correctly so that we can backprop gradient every real_batchsize images and update gradient once so that the training speed a faster? I know my code is nearly correct, but I just do not know how to change it.