Loss function with regularization by norm of Hessian?

I am trying to understand autograd better and would like to implement the following example. Despite searching, I haven’t found much on that elsewhere, and no working example.

In addition to improving the mean least squares error, I would like to take into account the norm of the hessian of model. That is, I add regularization by the squares of the second derivatives of the model, where the second derivatives are taken in the input (not the parameters).

I am aware that doing so becomes resource-intensive for large networks, I consider this a learning experience for now.

Hopefully, someone can explain. A very simple sample should look like this (Hessian in the comments):

import math
import random
import torch
import torch.nn

# some functions that we might try to interpolate 

def func( x1, x2 ):
  return math.sin( 2. * math.pi * x1 * x2 )


# generate input and outputs

N = 500

x1s = [ random.uniform(-1,1) for i in range(N) ] 
x2s = [ random.uniform(-1,1) for i in range(N) ] 
random.shuffle(x1s)
random.shuffle(x2s)
xs = [ [x1s[i],x2s[i]] for i in range(N) ]

ys = [ func(*x) for x in xs ]

data_x = torch.tensor(xs).resize_(N,2)
data_y = torch.tensor(ys).resize_(N,1)


# model 

model = torch.nn.Sequential(
          torch.nn.Linear(2, 10 ),
          torch.nn.ReLU(),         
          torch.nn.Linear(10, 11 ),
          torch.nn.ReLU(),         
          torch.nn.Linear(11, 12 ),
          torch.nn.ReLU(),         
          torch.nn.Linear(12, 1 )  
        )


# train the neural network 

optimizer = torch.optim.Adam( model.parameters(), lr = 0.01 )

num_epochs = 100
for epoch in range(num_epochs):
    
    # turn on training mode 
    model.train()
    model.zero_grad()
    
    # hand-written LSE
    loss = torch.mean( ( model( data_x ) - data_y )**2 )
    
    # What I want is similar to 
    # loss = torch.mean( ( model( data_x ) - data_y )**2 ) + sum_of_squares_of_entries(Hessian_in_x)
    # Here, Hessian_in_x is the Hessian matrix of the model with second derivatives in the input

    loss.backward()
    optimizer.step()
    
    # Print the loss at the end of each epoch
    print( 'Epoch [{}/{}], Loss: {:.4f}'.format( epoch+1, num_epochs, loss.item() ) )

Hi Kiwi!

Pytorch’s hessian() functional is likely to work for you.

If I understand you use case correctly, you want to compute the hessian of
the ordinary loss of your model with respect to the input to your model.

You then want to add the sum of squares of that hessian as a “regularization”
term to your ordinary loss to get a combined loss that you use to train your
model.

Based on that, here is a toy script that illustrates what I think you want:

import torch
print (torch.__version__)

p = torch.tensor ([3.0, 2.0, 3.0, 2.0, 3.0], requires_grad = True)   # model parameter

def model (input):                     # model with (global variable) p as trainable parameter
    return  p * (input**3 / 6)

def loss_fn (input):                   # some dummy loss function
    return  input.sum()

def hess_func (input):                 # package as the function for which to compute the hessian
    return loss_fn (model (input))

input = torch.arange (1.0, 6.0)        # input with respect to which to compute hessian

# computation graph records dependence of hessian on p
hess = torch.autograd.functional.hessian (hess_func, input, create_graph = True)

print ('hess:')
print (hess)

hess_loss = (hess**2).sum()            # hess_loss depends on p
model_loss = loss_fn (model (input))   # new forward pass, model_loss depends on p
loss = hess_loss + model_loss

loss.backward()                        # compute gradient of combined loss with respect to p

print ('p.grad:', p.grad)

And here is its output:

1.13.1
hess:
tensor([[ 3.,  0.,  0.,  0.,  0.],
        [ 0.,  4.,  0.,  0.,  0.],
        [ 0.,  0.,  9.,  0.,  0.],
        [ 0.,  0.,  0.,  8.,  0.],
        [ 0.,  0.,  0.,  0., 15.]], grad_fn=<ViewBackward0>)
p.grad: tensor([  6.1667,  17.3333,  58.5000,  74.6667, 170.8333])

Best.

K. Frank

1 Like

Thank you. My question was about taking the Hessian of model rather than loss_fn, to penalize changes of the model’s derivative. This gives me a better idea, however.