Different samples use different learning rates

I want to use different learning rates for different input samples to carry out model training. Any good suggestions?:thinking:

The way I think of is:

Assuming using nn.BCELoss() as loss function.
As here said,

'none' : no reduction will be applied

So I can get the loss value of each sample by:
loss_reduction_none = nn.BCELoss(reduction='none')

Then I can adjust the loss value of each sample like:

loss_reduction_none[1] = criterion_reduction_none[1] * scaling_factor_1,
loss_reduction_none[0] = criterion_reduction_none[2] * scaling_factor_2,
...
loss_reduction_none[n] = criterion_reduction_none[n] * scaling_factor_n,

Because the learning rate works in this way.

w_new = w_old - learning_rate * (∂ loss / ∂ weight)

As I mentioned above, I adjust the loss function value of each sample.
That indirectly solves the problem of learning rate adjustment for each sample.

Then I can use backward and step function:
loss_reduction_none.backward()
optimizer.step()

But I am not sure if PyTorch will use the loss function value which I have modified.
Actually, I don’t know the parameter reduction is designed for what kind of demand.

Is this the right answer?:thinking:

Maybe this question is something like below:

1 Like

I second @shirui-japina’s suggestions. Do a nn.LossFn(reduction='none'), weigh the different samples to your liking and then do your own reduction (e.g. mean).

1 Like

Thanks for your reply.

Do a nn.LossFn(reduction='none') , weigh the different samples to your liking

If the method above is correct (In fact, I think that’s correct),
but why I need to do reduction (e.g. mean)?
If I don’t do that, what will happen?

From mathematical theory

  • Batch Gradient Descent
    When the number of samples is large, the training process will be slow.

  • Stochastic Gradient Descent
    Decreased accuracy, not globally optimal.

  • Mini-batch Gradient Descen
    Use a subset of samples to update each parameter.

Usually we use Mini-batch Gradient Descen-like to update the parameters. Which means we get the average value of the loss function to update parameters.
Is the process of averaging (do reduction (e.g. mean)) equivalent to this process?
Or the process of averaging (do reduction (e.g. mean)) is just for PyTorch operating mechanism?

Yup :slight_smile: We have a loss for each sample in your batch, then mean/average that to get one loss-float which we can use to backpropagate. (edit: doesn’t need to be mean, but other reduction techniques such a sum)

1 Like

Thanks for your reply.:smiley:
I think I thoroughly understood this problem.:smiley::smiley:

1 Like

I have reconsidered it. And found some problems.

Topic above describes the operating mechanism of .backward() and optimizer.step() in PyTorch.
Simple understanding is:

  • .backward()
    Get the gradient for the parameters in model.

  • optimizer.step()
    performs a parameter update based on the current gradient (stored in .grad attribute of a parameter) and the update rule.

I mean, the function optimizer.step() considers the update method ,and actually update the parameters, but not PyTorch users to consider the update method. (We don’t have to do reduction (e.g. mean) or something by manual)

What we should do is,

  1. Before the function .backward(), adjust the loss function value of each sample.

  2. Then get gradient for the parameters in model by .backward().

  3. At the end, update the parameters based on the current gradient by optimizer.step().

The parameter in LossFn() reduction can set as: 'none' | 'mean' | 'sum'.
But not here to get the average of loss function value for the parameters updating.
(Although I don’t know what these 'mean' | 'sum' are for.)

If you set the reduction flag to ‘none’ and try to do the backward() on the un-reduced loss, you get this error message

RuntimeError: grad can be implicitly created only for scalar outputs

1 Like

:scream::scream::scream: