Loss of result precision from function convereted from numpy to torch

Hi All,

I am trying to move a model from Tf1 to Torch.
The model is quite involved and I have been unable to get a portion of it to work. In particular, I have found that a function appears to return a result in PyTorch that is around 10% off the result the equivalent function in Tensorflow or Numpy.

I believe that this 10% difference is an error that impacts my loss function and prevents the model from learning.

I have isolated the function here and show both the torch and numpy ‘equivalents’. Attached is a link to the torch model and the comparison data needed. putting them both in a directory with the script below will allow that script to be run. The variables in tensorflow/numpy are float 32 - I have set the variables in Torch to float 64 but it has not improved the issue. I compare the weights and biases from that model with my saved weights to show that they are the same. I don’t use the model in this example but show a function that it contains and the difference in result from that Torch version and an equivalent Numpy model. The numpy function is based off the original function in the TFv1 model (but with numpy rather than tf libraries). It agrees the TF function results to 10e-05.

Could you advise:

  1. is there anything ‘off’ with the way I’m using torch? In particular, I have a sequence set up with a custom layer that includes sigmoid activation and then a final layer that does not. That sequence is used for both an encoding and decoding network for the Autoencoder part of the network (with the layer dimensions reversed for the decode). I then manually access the weights from the ‘decode’ sequence to carry out computation in the attached function. Is this a fair approach?
  2. I had a go at getting gradients to compare to those produced by the TF model. I initialized both models to use tensors with values set from the same point mid way in the training cycle for the original TensorFlow model. This was unsuccessful as I could only get one tensor to provide gradient parameters and they were completely random compared to the TF equivalent on the same data. I used the function
    torch.autograd.grad(training_loss, sindy_autoencoder.Ξ_sindy_coefficients)

but it threw errors on all but one table. Is there a better way to do this? Am I wrong to expect similar results between TF and PT models (both using the ‘equivalent’ Adam loss function with the same learning rate parameter)?

At this point, I am open to any suggestions for further investigations. I realize this is quite an involved mail - bottom line is I would like to know how to make the computations in the calculate_first_and_second_derivative_with_torch function equivalent to calculate_first_and_second_derivative_with_np within 4 or 5 decimal places and would like to understand why they are not equivalent currently.

Regards,

Simon

The data needed to run this review is saved here:
https://drive.google.com/file/d/1lClIUWuHDGtibSXN2h5X-cyMaalU-cbX/view?usp=sharing

The full torch model is saved in a pickle for use with torch.load:
https://drive.google.com/file/d/1bFJYC5bHme7YmIbqTOjaxXvd-yrKczxH/view?usp=sharing

Put these files in the same directory as the code below to see the issue (inconsistent results between torch and np/TensorFlowv1).

import pickle
from typing import Dict, Any
import numpy as np
import torch

def run_experiment():


    ###############################################
    # load saved model
    ###############################################

    with open('recovered_autoencoder_network.pkl', 'rb') as f:
        recovered_autoencoder_network = pickle.load(f)

    # parameters needed for this issue
    params: Dict[str, Any] = {'weight_precision': torch.float64,
                              'sindy_precision': torch.float64,
                              'target_device': 'cuda'}

    sindy_autoencoder = torch.load('saved_model.pkl')
    sindy_autoencoder.to(params['target_device'])

    # this is a version of the 'problem' function in torch.
    def calculate_first_and_second_derivative_with_torch(input_and_derivatives, stack):
        x, dx, ddx = input_and_derivatives

        layer_count = len(stack)

        for i in range(layer_count - 1):
            x = torch.mm(x, stack[i].weights) + stack[i].bias
            x = torch.sigmoid(x)
            dx_prev = torch.mm(dx, stack[i].weights)
            sigmoid_first_derivative = torch.mul(x, 1 - x)
            sigmoid_second_derivative = torch.mul(sigmoid_first_derivative, 1 - 2 * x)
            dx = torch.mul(sigmoid_first_derivative, dx_prev)
            ddx = torch.mul(sigmoid_second_derivative, torch.square(dx_prev)) \
                  + torch.mul(sigmoid_first_derivative, torch.mm(ddx, stack[i].weights))
        dx = torch.mm(dx, stack[layer_count - 1].weights)
        ddx = torch.mm(ddx, stack[layer_count - 1].weights)

        return dx, ddx

    # this is the equivalent 'problem' function in numpy.
    def calculate_first_and_second_derivative_with_np(input, dx, ddx, weights, biases):
        dz = dx
        ddz = ddx

        def sigmoid(x):
            return 1 / (1 + np.exp(-x))

        for i in range(len(weights) - 1):
            input = np.matmul(input, weights[i]) + biases[i]
            input = sigmoid(input)
            dz_prev = np.matmul(dz, weights[i])
            sigmoid_derivative = np.multiply(input, 1 - input)
            sigmoid_derivative2 = np.multiply(sigmoid_derivative, 1 - 2 * input)
            dz = np.multiply(sigmoid_derivative, dz_prev)
            ddz = np.multiply(sigmoid_derivative2, np.square(dz_prev)) \
                  + np.multiply(sigmoid_derivative, np.matmul(ddz, weights[i]))
        dz = np.matmul(dz, weights[-1])
        ddz = np.matmul(ddz, weights[-1])

        return dz, ddz


    # Here I access the tensors recovered from the saved Tensorflow model and convert them to torch.
    converted_stack = [torch.tensor(recovered_autoencoder_network['v2_in_z'],
                                    device=torch.device(params['target_device']),
                                    dtype=params['sindy_precision']),
                       torch.tensor(recovered_autoencoder_network['v2_in_dz'],
                                    device=torch.device(params['target_device']),
                                    dtype=params['sindy_precision']),
                       torch.tensor(recovered_autoencoder_network['v2_in_sindy_predict'],
                                    device=torch.device(params['target_device']),
                                    dtype=params['sindy_precision'])]

    # Here I use the tensors captured from the tensorflow model (converted to torch)
    # with the torch version of the function.
    dx_decode_torch_test, ddx_decode_torch_test = \
        calculate_first_and_second_derivative_with_torch(converted_stack,
                                                         sindy_autoencoder.ψ_decoder_to_x)

    # here we see the pytorch function is 10% different to the results from numpy/tensorflow.
    print(
        'torch calculations\n------------------\ndx: {:.5f} (difference: {:.2%})\nddx: {:.5f} (difference: {:.2%})'
        .format(np.sum(dx_decode_torch_test.cpu().detach().numpy()),
                (np.sum(dx_decode_torch_test.cpu().detach().numpy())
                 - np.sum(recovered_autoencoder_network['dx_decode'])) /
                np.sum(recovered_autoencoder_network['dx_decode']),
                np.sum(recovered_autoencoder_network['ddx_decode']),
                (np.sum(ddx_decode_torch_test.cpu().detach().numpy())
                 - np.sum(recovered_autoencoder_network['ddx_decode'])) /
                np.sum(recovered_autoencoder_network['ddx_decode'])))

    dx_decode_np_test, ddx_decode_np_test = \
        calculate_first_and_second_derivative_with_np(recovered_autoencoder_network['v2_in_z'],
                                                      recovered_autoencoder_network['v2_in_dz'],
                                                      recovered_autoencoder_network['v2_in_sindy_predict'],
                                                      recovered_autoencoder_network['v2_in_decoder_weights'],
                                                      recovered_autoencoder_network['v2_in_decoder_biases'])

    # here we see the numpy function agrees the results recovered from tensorflow.
    print(
        '\nNumpy calculations\n------------------\ndx: {:.5f} (difference: {:.2%})\nddx: {:.5f} (difference: {:.2%})'
        .format(np.sum(dx_decode_np_test),
                (np.sum(dx_decode_np_test)
                 - np.sum(recovered_autoencoder_network['dx_decode'])) /
                np.sum(recovered_autoencoder_network['dx_decode']),
                np.sum(ddx_decode_np_test),
                (np.sum(ddx_decode_np_test)
                 - np.sum(recovered_autoencoder_network['ddx_decode'])) /
                np.sum(recovered_autoencoder_network['ddx_decode'])))

    # here we see that weight and bias initialization in PyTorch appears to be broadly the same but with
    # a residual error in the last decoder layer. This layer does not include a sigmoid activation. I believe
    # this explains the difference as the difference between larger numbers is reduced when a sigmoid is calculated.
    print(("\n\nWeight and bias comparison for two models (imported from np source)\n\n" +
           "weights comparison: \nl1 {:.5f} ({:.2%})\nl2 {:.5f} ({:.2%})\nl3 {:.5f} ({:.2%})\nl4 {:.5f} ({:.2%})\n\n" +
           "bias comparison: \nb1 {:.5f} ({:.2%})\nb2 {:.5f} ({:.2%})\nb3 {:.5f} ({:.2%})\nb4 {:.5f} ({:.2%}))")
          .format(np.sum(sindy_autoencoder.ψ_decoder_to_x[0].weights.cpu().detach().numpy()
                         - recovered_autoencoder_network['v2_in_decoder_weights'][0]),
                  (np.sum(sindy_autoencoder.ψ_decoder_to_x[0].weights.cpu().detach().numpy()
                          - recovered_autoencoder_network['v2_in_decoder_weights'][0]))
                  / np.sum(recovered_autoencoder_network['v2_in_decoder_weights'][0]),
                  (np.sum(sindy_autoencoder.ψ_decoder_to_x[1].weights.cpu().detach().numpy()
                          - recovered_autoencoder_network['v2_in_decoder_weights'][1])),
                  (np.sum(sindy_autoencoder.ψ_decoder_to_x[1].weights.cpu().detach().numpy()
                          - recovered_autoencoder_network['v2_in_decoder_weights'][1]))
                  / np.sum(recovered_autoencoder_network['v2_in_decoder_weights'][1]),
                  (np.sum(sindy_autoencoder.ψ_decoder_to_x[2].weights.cpu().detach().numpy()
                          - recovered_autoencoder_network['v2_in_decoder_weights'][2])),
                  (np.sum(sindy_autoencoder.ψ_decoder_to_x[2].weights.cpu().detach().numpy()
                          - recovered_autoencoder_network['v2_in_decoder_weights'][2]))
                  / np.sum(recovered_autoencoder_network['v2_in_decoder_weights'][2]),
                  (np.sum(sindy_autoencoder.ψ_decoder_to_x[3].weights.cpu().detach().numpy()
                          - recovered_autoencoder_network['v2_in_decoder_weights'][3])),
                  (np.sum(sindy_autoencoder.ψ_decoder_to_x[3].weights.cpu().detach().numpy()
                          - recovered_autoencoder_network['v2_in_decoder_weights'][3]))
                  / np.sum(recovered_autoencoder_network['v2_in_decoder_weights'][3]),
                  (np.sum(sindy_autoencoder.ψ_decoder_to_x[0].bias.cpu().detach().numpy()
                          - recovered_autoencoder_network['v2_in_decoder_biases'][0])),
                  (np.sum(sindy_autoencoder.ψ_decoder_to_x[0].bias.cpu().detach().numpy()
                          - recovered_autoencoder_network['v2_in_decoder_biases'][0]))
                  / np.sum(recovered_autoencoder_network['v2_in_decoder_biases'][0]),
                  (np.sum(sindy_autoencoder.ψ_decoder_to_x[1].bias.cpu().detach().numpy()
                          - recovered_autoencoder_network['v2_in_decoder_biases'][1])),
                  (np.sum(sindy_autoencoder.ψ_decoder_to_x[1].bias.cpu().detach().numpy()
                          - recovered_autoencoder_network['v2_in_decoder_biases'][1]))
                  / np.sum(recovered_autoencoder_network['v2_in_decoder_biases'][1]),
                  (np.sum(sindy_autoencoder.ψ_decoder_to_x[2].bias.cpu().detach().numpy()
                          - recovered_autoencoder_network['v2_in_decoder_biases'][2])),
                  (np.sum(sindy_autoencoder.ψ_decoder_to_x[2].bias.cpu().detach().numpy()
                          - recovered_autoencoder_network['v2_in_decoder_biases'][2]))
                  / np.sum(recovered_autoencoder_network['v2_in_decoder_biases'][2]),
                  (np.sum(sindy_autoencoder.ψ_decoder_to_x[3].bias.cpu().detach().numpy()
                          - recovered_autoencoder_network['v2_in_decoder_biases'][3])),
                  (np.sum(sindy_autoencoder.ψ_decoder_to_x[3].bias.cpu().detach().numpy()
                          - recovered_autoencoder_network['v2_in_decoder_biases'][3]))
                  / np.sum(recovered_autoencoder_network['v2_in_decoder_biases'][3])))


# Press the green button in the gutter to run the script.
if __name__ == '__main__':
    number_of_experiments = 1
    for i in range(number_of_experiments):
        run_experiment()