Possible reason for NN regressions giving inferior results to ridge regression

blade · February 11, 2020, 11:13pm

I am trying to use a simple 3 layer neural net to predict a scaler output given an input of dimension 430. For my network, I use 2 layers of dimensions 600 and 80 and I use leakyReLU non-linearities. I also tried to do the regression using ridge regression. My solution for regression is better than my pytorch implementation. I was wondering what. I have been playing around with layers, dimensions, non-linearity, learning rate but non improved the results. I was wondering if this is something to be expected and if not, what else would you suggest analyzing.

ptrblck · February 12, 2020, 7:05am

If you have a working model, e.g. sklearn.linear_model.Ridge make sure to dig a bit into the model and then you could try to reimplement it in PyTorch.
A lot of sklearn models use some regularization, which proved to work good, while these techniques are often forgotten in the custom PyTorch implementation.

blade · February 13, 2020, 2:47pm

Thanks @ptrblck! I fixed the regularization but the error did not get much better. I was in fact using RidgeCV from sklearn, but the idea is same. I looked at the source code (thanks, that is always a good research strategy advice) and I think that the only difference now is that the code uses SVD for finding the coefficients: we find the derivative with respect to weights and find the solution in terms of SVD decomposition matrices.

As far as I understand, backprop does that for me in a network anyways and I don’t care about how to update weights other than my loss function. I can use pytorch to implement this but if I use SVD then it won’t be a network anymore.

For interested future reader, as can be seen in Bishop’s PRML book:

The particular case of a quadratic regularizer is called ridge
regression (Hoerl and Kennard, 1970). In the context of neural
networks, this approach is known as weight decay.

So we can either set weight_decay parameter in optimizer to a non-zero value or we can regularize the loss function with L2 as follows:

criterion = torch.nn.MSELoss()
lmbd = 1e-8  # for custom L2 regularization
loss = criterion(y_pred, y_train)

reg_loss = None
for param in model.parameters():
    if reg_loss is None:
        reg_loss = torch.sum(param ** 2)
    else:
        reg_loss = reg_loss + param.norm(2) ** 2

loss += lmbd * reg_loss