Different samples use different learning rates

shirui-japina · October 15, 2019, 2:01am

I want to use different learning rates for different input samples to carry out model training. Any good suggestions?

shirui-japina · October 15, 2019, 3:00am

The way I think of is:

Assuming using nn.BCELoss() as loss function.
As here said,

'none' : no reduction will be applied

So I can get the loss value of each sample by:
loss_reduction_none = nn.BCELoss(reduction='none')

Then I can adjust the loss value of each sample like:

loss_reduction_none[1] = criterion_reduction_none[1] * scaling_factor_1,
loss_reduction_none[0] = criterion_reduction_none[2] * scaling_factor_2,
...
loss_reduction_none[n] = criterion_reduction_none[n] * scaling_factor_n,

Because the learning rate works in this way.

w_new = w_old - learning_rate * (∂ loss / ∂ weight)

As I mentioned above, I adjust the loss function value of each sample.
That indirectly solves the problem of learning rate adjustment for each sample.

Then I can use backward and step function:
loss_reduction_none.backward()
optimizer.step()

But I am not sure if PyTorch will use the loss function value which I have modified.
Actually, I don’t know the parameter reduction is designed for what kind of demand.

Is this the right answer?

Maybe this question is something like below:

Oli · October 15, 2019, 5:56am

I second @shirui-japina’s suggestions. Do a nn.LossFn(reduction='none'), weigh the different samples to your liking and then do your own reduction (e.g. mean).

shirui-japina · October 15, 2019, 6:36am

Thanks for your reply.

Do a nn.LossFn(reduction='none') , weigh the different samples to your liking

If the method above is correct (In fact, I think that’s correct),
but why I need to do reduction (e.g. mean)?
If I don’t do that, what will happen?

From mathematical theory

Batch Gradient Descent
When the number of samples is large, the training process will be slow.
Stochastic Gradient Descent
Decreased accuracy, not globally optimal.
Mini-batch Gradient Descen
Use a subset of samples to update each parameter.

Usually we use Mini-batch Gradient Descen-like to update the parameters. Which means we get the average value of the loss function to update parameters.
Is the process of averaging (do reduction (e.g. mean)) equivalent to this process?
Or the process of averaging (do reduction (e.g. mean)) is just for PyTorch operating mechanism?

Oli · October 15, 2019, 8:48am

Yup We have a loss for each sample in your batch, then mean/average that to get one loss-float which we can use to backpropagate. (edit: doesn’t need to be mean, but other reduction techniques such a sum)

shirui-japina · October 15, 2019, 9:22am

Thanks for your reply.
I think I thoroughly understood this problem.

shirui-japina · October 15, 2019, 2:07pm

I have reconsidered it. And found some problems.

Topic above describes the operating mechanism of .backward() and optimizer.step() in PyTorch.
Simple understanding is:

.backward()
Get the gradient for the parameters in model.
optimizer.step()
performs a parameter update based on the current gradient (stored in .grad attribute of a parameter) and the update rule.

I mean, the function optimizer.step() considers the update method ,and actually update the parameters, but not PyTorch users to consider the update method. (We don’t have to do reduction (e.g. mean) or something by manual)

What we should do is,

Before the function .backward(), adjust the loss function value of each sample.
Then get gradient for the parameters in model by .backward().
At the end, update the parameters based on the current gradient by optimizer.step().

The parameter in LossFn() reduction can set as: 'none' | 'mean' | 'sum'.
But not here to get the average of loss function value for the parameters updating.
(Although I don’t know what these 'mean' | 'sum' are for.)

Oli · October 15, 2019, 2:34pm

If you set the reduction flag to ‘none’ and try to do the backward() on the un-reduced loss, you get this error message

RuntimeError: grad can be implicitly created only for scalar outputs

shirui-japina · October 15, 2019, 2:37pm