Hi I have two related question to ask…
Recently I am doing my project ,in this project I have loss function contains loss A(classification) and loss B(regression), There are only few classes but lots of boxes in my dataset which result in my model have a little hard time decreasing regression loss ,(Note the regression loss is actually decreasing,but it’s 3 or 4 times larger than classification)
And Here’s a first question :
In this kind of situation ,What am I supposed to do to make regression loss go down faster?
Since the gradient would be affected by upstream gradient ,So If I have a multiplier for regression loss formula , would this have direct impact on my gradient and make regression decrease faster?
for example : If my loss calculation looks like this :
reg_loss = torch.where(torch.le(reg_dif, 1.0),
0.5 *torch.pow(reg_dif, 2),
reg_dif - 0.5)
Should I simply multiply a const number and this would make the gradient different than original one(without multiplier)?
Yes, loss scaling is a valid approach. Basically, cross-entropy formula matches categorical distribution density (NLL), and MSE - gaussian with scale 1. So, rescaling loss corresponds to adding gaussian scale as either (trainable scalar) parameter or hyperparameter. Similar logic applies to other loss functions, if you can imagine them describing log-densities of some probability distributions.
OTOH, such networks are multi-task ones, with parameter sharing. So it is possible that tasks will “compete” for shared parameters. This is not always a problem, but when it is, loss scaling won’t be a panacea…
Thanks for your reply! I do this with retinanet which they separate regression and classification problems（not sharing features) !
By the way ,What should I do to make regression part better , my idea is to use 0.7 for regression and 0.3 for classification since I want the regression decrease more.Correct me if I am wrong ,Thanks in advance!
That’s ok, for non-trainable proportion it is enough to scale one of losses, as total loss will be additionally implicitly rescaled by learning rate. I.e. total_loss = classification_loss + regr_loss * k, with positive k hyperparameter (e.g. 0.7/0.3).
You may try training loss weights, in that case they should sum to 1 indeed.
PS I’m not familiar with retinanet to give more specific tips.