# Differences between gradient calculated by different reduction methods

I’m playing with different reduction methods provided in built-in loss functions. In particular, I would like to compare the following.

• The averaged gradient by performing backward pass for each loss value calculated with `reduction="none"`
• The gradient averaged by dividing the batch size with `reduction="sum"`
• The average gradient yielded by `reduction="mean"`
• The average gradient calculated by `reduction="mean"`, with the data points fed into the model one at a time.

My code for producing the experiment is as follows:

``````def estimate_gradient(model, optimizer, batch):
criterion_no_reduction = nn.CrossEntropyLoss(reduction="none").cuda()
criterion_sum = nn.CrossEntropyLoss(reduction="sum").cuda()
criterion_avg = nn.CrossEntropyLoss().cuda()

input, target = batch
input, target = input.cuda(), target.cuda()
output = model(input)
n = len(output)

loss_no_reudction = criterion_no_reduction(output, target)
for i in range(n):
loss_no_reudction[i].backward(retain_graph=True)
for j, param in enumerate(model.parameters()):
if i == 0:
else:

loss_sum = criterion_sum(output, target)
loss_sum.backward(retain_graph=True)
for j, param in enumerate(model.parameters()):
if j == 0:
else:

loss_avg = criterion_avg(output, target)
loss_avg.backward(retain_graph=True)
for j, param in enumerate(model.parameters()):
if j == 0:
else:

target = target.view(-1, 1)
for i in range(n):
curr_output = output[i].view(1, -1)
loss = criterion_avg(curr_output, target[i])
loss.backward(retain_graph=True)
for j, param in enumerate(model.parameters()):
if i == 0:
else:

``````Maximum discrepancy between reduction = none and sum: 0.0316
That is, the result produced by `reduction=none` and one-by-one backward pass appear to be identical, while `reduciton=sum` and `reduction=mean` yields different results from the previous pair. It would be really helpful to explain the discrepancy (maybe due to `retain_graph=True`?) and thanks in advance for any help!