Implementing custom loss function for ridge regression

def ridge_loss(Y,pred,w,lamb):
    pred_loss=torch.norm((Y-pred),p='fro')**2
    reg=torch.norm(w,p='fro')**2
    return((1/Y.size()[0])*pred_loss + lamb*reg)

def fit(lamb,X_pt,Y_pt,w,epochs = 5000, learning_rate = 0.1):
    w_pt=torch.tensor(w,requires_grad=True) 
    opt=torch.optim.Adam([w_pt], lr=learning_rate, betas=(0.9, 0.99), eps=1e-08, weight_decay=0, amsgrad=False)
    for epoch in range(epochs):
        pred = torch.matmul(X_pt,w_pt)
        loss = ridge_loss(Y_pt,pred,w_pt,lamb) 
        loss.backward()
        opt.step()      
        opt.zero_grad()  
     return w_pt

X_pt=torch.from_numpy(X) # xtrain
Y_pt=torch.from_numpy(Y) #ytrain  
Y_ptt=torch.from_numpy(Y_test) #xtest
X_ptt=torch.from_numpy(X_test) #ytest
w=np.random.rand(X.shape[1],Y.shape[1])
weight=fit(0.001,X_pt,Y_pt,w)

i am new to pytorch . i want to learn how to use custom loss functions in pytorch and in order to get started i wanted implement ridge regression and i find that my error values are very high than the sklearns implementation of ridge regression. can you please help me find out the mistake in writing my loss function .

Sklearn most likely is not using first-order gradient descent to solve this. I can’t spot an error in your code, so maybe you just need to add lr decay (scheduler) - in general you should check if your loss decreases at a reasonable pace. Another possible issue is non-normalized data (i.e. epoch 0 prediction is too far off).

@googlebot . thanks for replying .

  1. i will implement a scheduler .
  2. i tried printing my loss while the gradient descent is running , it seems to initially fall down and then it stays constant at not so low value without any change
  3. my X is 0 mean unit variance (unit normal distribution) so i think scaling shouldnt be an issue . please let me know if i understood this wrong . what do you mean by epoch 0 prediction is too far off?

I think your 1/Y.size() term is incorrect, you’re overemphasizing L2 penalty.

e.g. if true_y = x * 100 + b, but your w initialization range is like -3…3 (and you don’t model bias at all). Accelerating optimizers help here, but that may be not enough for harder problems and mini-batches.

@googlebot i have figured out a bug in my error metric function and hence it was showing higher error rate . the method seems to converge without any need of scheduling the learning rate . thanks a lot for your help