Backpropagation with mini-batches

smr97 · June 7, 2018, 8:43am

Hi,
I see that for most of the implementations in pytorch, it is common to calculate the net output of a mini-batch by running it through a model, and then calculating and normalizing the loss. This implies that the loss (on which backward() is called) seems to be an arithmetic mean of all the losses produced by individual samples of the mini-batch at the output layer.

My question is, does this process also take place at every layer? Do we normalize the loss at a given layer, use that same loss for all activations of that layer and compute the gradients for back-propagation? It would be really helpful if someone could provide a reference to the modules in pytorch that actually do this when backward is called.

Vij · June 26, 2018, 4:28pm

Can you post the code ?

smr97 · June 27, 2018, 4:16am

github.com

pytorch/examples/blob/f83508117b1ba9b752b227de992799093af3b215/imagenet/main.py#L193




end = time.time()
for i, (input, target) in enumerate(train_loader):
    # measure data loading time
    data_time.update(time.time() - end)


    target = target.cuda(non_blocking=True)


    # compute output
    output = model(input)
    loss = criterion(output, target)


    # measure accuracy and record loss
    prec1, prec5 = accuracy(output, target, topk=(1, 5))
    losses.update(loss.item(), input.size(0))
    top1.update(prec1[0], input.size(0))
    top5.update(prec5[0], input.size(0))


    # compute gradient and do SGD step
    optimizer.zero_grad()
    loss.backward()

This issue is with most of the implementations, but for you can for example check out the pytorch imagenet implementation. Upon digging further I found that we can give reduce=False as an argument to the loss function, which will prevent averaging out the loss.

However, the question remains as to whether the same thing happens (by a way of default) in case of hidden layers during back-propagation?