# How to decrease the weight of a mini-batch?

The problem comes from this paper: Decoupling “when to update” from “how to update”.
Regardless the detail of the paper, this implementation will resulting mini-batches has different sizes.
For example, if we set the mini-batch size to 128. Some of the batches could have only 10 or even zero samples.
If there’s zero sample in a batch, then we can just skip the update at this iteration.
However, if the sample in a batch is much samller, let’s say 10% of the 128, then we have to scale down the update of the gradient on this iteration.
The idea is that if there is a less important batch, it shouldn’t have the same impact as a full-size batch.
If we reduce the learning rate 10 time on this paticular batch, we can achieve this effect, and use the normal learning rate on the full-size batches.
But how can we achieve this in PyTorch?
I hope my description is clear. Thank you so much for any suggestions.

you can reduce output of loss function yourself or use `reduction='sum'` instead of `reduction='mean'`. this way your orverall loss is proportional to number of samples in each batchs.
CrossEntropyLoss:

• reduction (string , optional) – Specifies the reduction to apply to the output: `'none'` | `'mean'` | `'sum'`. `'none'`: no reduction will be applied, `'mean'`: the weighted mean of the output is taken, `'sum'`: the output will be summed. Note: `size_average` and `reduce` are in the process of being deprecated, and in the meantime, specifying either of those two args will override `reduction`. Default: `'mean'`

That’s a very intersting solution to the problem. I only saw the weight parameter in the loss func, but don’t think it fits this situation. If it works, it will scale down nicely. I give a try and let you know the result. Thank you.

When `nn.CrossEntropyLoss(reduction='sum')` is used, the model simple does not converge with `torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)`.
Once swith back to `mean`, the same model converge within 5 epochs tested on MNIST.
It seems the `sum` option is only used in evaluation for computing the overall loss, and nobody use it on training.

@sunfishcc it’s because of large learning rate.
When you’re using mean, you’re effectively dividing learning rate by batch size.
use smaller learning rate.

@mMagmer I see. It works once I changed `lr` to `0.0001`.

I also tried to manually adjust lr when then batch size is low: `optimizer.param_groups[0]['lr'] *= 0.1`, and then reset it after calling the `step()` function. This seems like a hack, not sure about if there’s any side effect.

Thank you so much for your suggestion.