How to use the 1st derivative of a network for training a data-driven physics model approximating a function?

Hi guys,

First, I am new to this forum and to machine learning as well, please keep that in mind. :slight_smile:

I am trying to reproduce the neural network proposed in this paper using PyTorch: A mechanics-informed artificial neural network approach in data driven constitutive modeling. The goal of the neural network is to approximate a nonlinear function, which maps the input to the output, but using the derivative of the network as the actual output. That means, that the network learns the integral of the function.

My approach to coding this with PyTorch is at the end of this question (It would be great if you could review that as well, as the mistake might be somewhere else and I don’t see it.).

The whole idea of the paper is as follows:

Input: x_0; Output: y

x_1 = activation(w_input * x_0 +b)
for l = 2,…,N_l do
	x_l = activation(softplus(Q;alpha) * x_(l-1) + b_l + w_skip_l * x_o
end for
y = softplus(Q, alpha) * x_(N_l-1) + f(x_0)

learnable parameters: [alpha_l, Q_l, W_skip_l, b_l, A (see further down), beta] 

with the activation function being essentially the softplus function squared :

class LearnedSoftPlusSquared(torch.nn.Module):
    def __init__(self, init_beta=1.0, threshold=20):
        super().__init__()
        self.log_beta = torch.nn.Parameter(torch.tensor(float(init_beta)).log())
        self.threshold = 20
    def forward(self, x):
        beta = self.log_beta.exp()
        beta_squared = beta**2
        beta_squared_x = beta_squared * x
        return torch.where(beta_squared_x < 20, 0.5 * ((torch.log1p(beta_squared_x.exp()) / beta_squared)**2), x)

And the loss function (better representation in the linked paper):

Loss = sum_over_samples_and_dimensions(((derivative_wrt_to_input(evaluated at input) - derivative_wrt_to_input(evaluated at 0) - target value)/elementwise_standard_deviation)^2)

In code:

def final_loss_func(gradients_output_wrt_input, gradients_corrector, target):
    sigma_k = torch.std(target, dim = 0)
    loss = torch.mean(torch.sum(((gradients_output_wrt_input - gradients_corrector - target)/sigma_k)**2, dim = 1))
    return loss

I can’t really figure out my mistake, although I believe the problem lies in my implementation of the loss function, which uses the derivative of the network and a correction term (which is also based on the derivative of the network.

Right now, the loss starts out very high and decreases to a certain point, from which it rises again.
It is also not possible to overfit, even with very small examples. In my opinion that is caused by the correction term which is applied in the loss function.

My questions regarding this network are:

How does one implement the double differentiation properly?

I tried to get the gradient of the output with respect to the input by using autograd, same procedure for the correction term, enabling the option create_graph. Those quantities were then used in the loss function, which is then used with the backward() method. My code for the training routine looks like this:

X_corr = torch.zeros(1,input_dim, requires_grad = True).type(torch.FloatTensor)
Y_corr = torch.zeros_like(X_corr)

def final_loss_func(gradients_output_wrt_input, gradients_corrector, target):
    sigma_k = torch.std(target, dim = 0)
    loss = torch.mean(torch.sum(((gradients_output_wrt_input - gradients_corrector - target)/sigma_k)**2, dim = 1))
    return loss

model = MyModule()
learning_rate = 0.00001
optimizer = torch.optim.Adagrad(model.parameters(), lr=learning_rate)
epochs = 20000+1


for epoch in range(epochs):
    
    optimizer.zero_grad()
    
    pred = model(X)
    pred_corr = model(X_corr)
    
    X_gradients = torch.autograd.grad(pred, X, retain_graph=True, grad_outputs=torch.ones_like(pred), create_graph = True)[0]
    X_corr_gradients = torch.autograd.grad(pred_corr, X_corr, retain_graph=True, grad_outputs=torch.ones_like(pred_corr), create_graph = True)[0]

    X_corr_gradients_no_grad = X_corr_gradients.detach()
    loss = final_loss_func(X_gradients, X_corr_gradients_no_grad, Y)
    optimizer.step()
        

Also: how do I add a convex function like f(x) = x^T *A^T *A *x, which is proposed in the paper, to the output, while keeping the Matrix A as a learnable parameter?

The problem here is the dimension of the input is batch_size * sample_dimension leading to f(x) as a batch_size * batch_size matrix, although it must be a batch_size * 1 vector. Applying f(x) to each sample separately delivers the right dimension, but also makes it impossible to learn A as a parameter, because it makes the differentiation impossible, right? My code for that looks like this:

class ConvexFunction(torch.nn.Module):
    def __init__(self, input_size):
        super().__init__()
        A = torch.Tensor(input_size, input_size)
        self.A = torch.nn.Parameter(A)
        nn.init.orthogonal_(self.A)
    def forward(self, x):
        A_T = torch.transpose(self.A, dim0 = 0, dim1 = 1)
        A_T_A = torch.matmul(A_T, self.A)
        result = torch.zeros(x.size(dim = 0),1)
        for i in range(x.size(dim = 0)):
            result[i] = x[i] @ A_T_A @ x[i]
        return result

The whole code for the network:

import torch 
import torch.nn as nn
import pandas as pd

torch.manual_seed(0)

df = pd.read_csv("Strain_Stress_Pairs_XL_3columns.csv", sep = ',', decimal='.')
x = df.iloc[:,0:3].values
y = df.iloc[:,3:7].values

x = x *9.091
X = torch.from_numpy(x).type(torch.FloatTensor)
Y = torch.from_numpy(y).type(torch.FloatTensor)
X.requires_grad_(True)


means = Y.mean(dim=1, keepdim=True)
stds = Y.std(dim=1, keepdim=True)
for i in range(stds.size(dim = 0)):
    if stds[i] < 1e-20:
        stds[i] = 1e-20
Y = (Y - means) / stds


X_corr = torch.zeros(1,3, requires_grad = True).type(torch.FloatTensor)
Y_corr = torch.zeros_like(X_corr)

class ConvexFunction(torch.nn.Module):
    def __init__(self, input_size):
        super().__init__()
        A = torch.Tensor(input_size, input_size)
        self.A = torch.nn.Parameter(A)
        nn.init.orthogonal_(self.A)
    def forward(self, x):
        A_T = torch.transpose(self.A, dim0 = 0, dim1 = 1)
        A_T_A = torch.matmul(A_T, self.A)
        result = torch.zeros(x.size(dim = 0),1)
        for i in range(x.size(dim = 0)):
            result[i] = x[i] @ A_T_A @ x[i]
        return result

class LearnedSoftPlusSquared(torch.nn.Module):
    def __init__(self, init_beta=1.0, threshold=20):
        super().__init__()
        self.log_beta = torch.nn.Parameter(torch.tensor(float(init_beta)).log())
        self.threshold = 20
    def forward(self, x):
        beta = self.log_beta.exp()
        beta_squared = beta**2
        beta_squared_x = beta_squared * x
        return torch.where(beta_squared_x < 20, 0.5 * ((torch.log1p(beta_squared_x.exp()) / beta_squared)**2), x)

class SoftPlusLinear(nn.Module):
    def __init__(self, input_size, output_size, init_alpha=1.0, threshold=20):
        super().__init__()
        w = torch.Tensor(output_size, input_size)
        self.w = nn.Parameter(w)
        nn.init.orthogonal_(self.w)
        b = torch.Tensor(output_size).fill_(0)
        self.b = nn.Parameter(b)
        self.log_alpha = torch.nn.Parameter(torch.tensor(float(init_alpha)).log())
        self.threshold = 20

    def forward(self, x):
        alpha = self.log_alpha.exp()
        alpha = alpha**2
        alpha_weight = alpha * self.w
        w_new = torch.log1p(alpha_weight.exp()) / alpha
        return nn.functional.linear(x, w_new, bias=self.b)

class MyModule(nn.Module):
    def __init__(self):
        super().__init__()
        
        input_size = 3
        out_l1 = 9
        out_l2 = 9
        out_l3 = 6
        out_l4 = 3
        output_size = 1
        
        w_skip_1 = torch.Tensor(input_size, out_l2)
        self.w_skip_1 = nn.Parameter(w_skip_1)
        nn.init.orthogonal_(self.w_skip_1)
        
        w_skip_2 = torch.Tensor(input_size, out_l3)
        self.w_skip_2 = nn.Parameter(w_skip_2)
        nn.init.orthogonal_(self.w_skip_2)
        
        w_skip_3 = torch.Tensor(input_size, out_l4)
        self.w_skip_3 = nn.Parameter(w_skip_3)
        nn.init.orthogonal_(self.w_skip_3)

        matrix_conv = torch.Tensor(input_size,1)
        self.matrix_conv = nn.Parameter(matrix_conv)
        nn.init.orthogonal_(self.matrix_conv)
        
        self.convex_layer = ConvexFunction(input_size)    
        self.l1 = nn.Linear(input_size,out_l1)
        self.a1 = LearnedSoftPlusSquared()
        self.l2 = SoftPlusLinear(out_l1,out_l2)
        self.a2 = LearnedSoftPlusSquared()
        self.l3 = SoftPlusLinear(out_l2,out_l3)
        self.a3 = LearnedSoftPlusSquared()
        self.l4 = SoftPlusLinear(out_l3,output_size)

    def forward(self, x):

        x_in = x
        x = self.l1(x)
        x = self.a1(x)
        x = self.l2(x)
        x = torch.add(x, torch.mm(x_in, self.w_skip_1))
        x = self.a2(x)
        x = self.l3(x) 
        x = torch.add(x, torch.mm(x_in, self.w_skip_2))
        x = self.a3(x)
        x = self.l4(x)#+ self.convex_layer(x_in) convex function not working!
        return x
    
def final_loss_func(gradients_output_wrt_input, gradients_corrector, target):
    sigma_k = torch.std(target, dim = 0)
    loss = torch.mean(torch.sum(((gradients_output_wrt_input - gradients_corrector - target)/sigma_k)**2, dim = 1))
    return loss

model = MyModule()
learning_rate = 0.00001
optimizer = torch.optim.Adagrad(model.parameters(), lr=learning_rate)
epochs = 20000+1


for epoch in range(epochs):
    
    optimizer.zero_grad()
    
    pred = model(X)
    pred_corr = model(X_corr)
    
    X_gradients = torch.autograd.grad(pred, X, retain_graph=True, grad_outputs=torch.ones_like(pred), create_graph = True)[0]
    X_corr_gradients = torch.autograd.grad(pred_corr, X_corr, retain_graph=True, grad_outputs=torch.ones_like(pred_corr), create_graph = True)[0]

    X_corr_gradients_no_grad = X_corr_gradients.detach()
    loss = final_loss_func(X_gradients, X_corr_gradients_no_grad, Y)
    optimizer.step()
    loss.backward()    

    print(loss)

Thank you for your time.