Gradient Calculation for part of the mini-batch

Hi, I am currently considering a special case of training the model, which I only use part of the data to compute the loss, for example:

def train(epoch):
    total_loss = total_correct = 0
    for batch in train_loader:
        out = model(batch.x,[:100]
        y = batch.y[:100].squeeze()
        loss = F.cross_entropy(out, y)

        total_loss += float(loss)
        total_correct += int(out.argmax(dim=-1).eq(y).sum())

I have two questions about the backward propagation:

(1) What are the gradients for those data points that do not participate in the loss computation? For instance, in the given example, what are the gradients for data points beyond the first 100? Are they zero?

(2) In this case, which part of the data will participant in update the model? Do those data points that are not part of the loss computation (say after first 100 in the above example) contribute to the update of the model parameters?


  1. If you have any operations such as batch normalization that operate on the entire batch of data, then data points beyond the first 100 can also contribute to the update of parameters.

  2. If your answer to (2) was yes, then gradients may be non-zero; otherwise they would be zero yes. Note that typically data points shouldn’t require gradients, only parameters should have gradients.

Thanks for your reply! One following question: in this scenario, is it equivalent to using two input tensors, x1 with requires_grad = True of length 100 and x2 with requires_grad = False of length (original batch size - 100), then concatenating them together as the input? Thanks again!