Loss of result precision from function convereted from numpy/TFv1 to PyTorch

Simon_Watson · August 16, 2022, 3:24am

I am trying to move a model from Tf1 to Torch. The model is quite involved and I have been unable to get a portion of it to work. In particular, I have found that a function appears to return a result in PyTorch that is around 10% off the result the equivalent function in Tensorflow or Numpy.

I believe that this 10% difference is an error that impacts my loss function and prevents the model from learning.

I have isolated the function here and show both the torch and numpy ‘equivalents’. Attached is a link to the torch model and the comparison data needed. Below are two code segments. I believe the Numpy result is the better one because it both agrees the Tensorflow v1 result to an accuracy of 10e-05 and in the model I’m dealing with, this function trains successfully when the Torch equivalent does not.

My question is in two parts:

How come the Numpy function returns better results than the Torch function and
is there away of arranging the Torch function so it has accuracy closer to the Numpy function.

Regards,

Simon

The data needed to run this review is saved here:
https://drive.google.com/file/d/1lClIUWuHDGtibSXN2h5X-cyMaalU-cbX/view?usp=sharing

The full torch model is saved in a pickle for use with torch.load:
https://drive.google.com/file/d/1bFJYC5bHme7YmIbqTOjaxXvd-yrKczxH/view?usp=sharing

The data load and two functions:

import pickle
from typing import Dict, Any
import numpy as np
import torch

with open('recovered_autoencoder_network.pkl', 'rb') as f:
    recovered_autoencoder_network = pickle.load(f)

# parameters needed for this issue
params: Dict[str, Any] = {'weight_precision': torch.float64,
                          'sindy_precision': torch.float64,
                          'target_device': 'cuda'}

sindy_autoencoder = torch.load('saved_model.pkl')
sindy_autoencoder.to(params['target_device'])

# this is a version of the 'problem' function in torch.
def calculate_first_and_second_derivative_with_torch(input_and_derivatives, stack):

    x, dx, ddx = input_and_derivatives

    layer_count = len(stack)

    for i in range(layer_count - 1):
        x = torch.mm(x, stack[i].weights) + stack[i].bias
        x = torch.sigmoid(x)
        dx_prev = torch.mm(dx, stack[i].weights)
        sigmoid_first_derivative = torch.mul(x, 1 - x)
        sigmoid_second_derivative = torch.mul(sigmoid_first_derivative, 1 - 2 * x)
        dx = torch.mul(sigmoid_first_derivative, dx_prev)
        ddx = torch.mul(sigmoid_second_derivative, torch.square(dx_prev)) \
              + torch.mul(sigmoid_first_derivative, torch.mm(ddx, stack[i].weights))
    dx = torch.mm(dx, stack[layer_count - 1].weights)
    ddx = torch.mm(ddx, stack[layer_count - 1].weights)

    return dx, ddx

# this is the equivalent 'problem' function in numpy.
def calculate_first_and_second_derivative_with_np(input, dx, ddx, weights, biases):
    dz = dx
    ddz = ddx

    def sigmoid(x):
        return 1 / (1 + np.exp(-x))

    for i in range(len(weights) - 1):
        input = np.matmul(input, weights[i]) + biases[i]
        input = sigmoid(input)
        dz_prev = np.matmul(dz, weights[i])
        sigmoid_derivative = np.multiply(input, 1 - input)
        sigmoid_derivative2 = np.multiply(sigmoid_derivative, 1 - 2 * input)
        dz = np.multiply(sigmoid_derivative, dz_prev)
        ddz = np.multiply(sigmoid_derivative2, np.square(dz_prev)) \
              + np.multiply(sigmoid_derivative, np.matmul(ddz, weights[i]))
    dz = np.matmul(dz, weights[-1])
    ddz = np.matmul(ddz, weights[-1])

    return dz, ddz

dx_decode_np_test, ddx_decode_np_test = \
    calculate_first_and_second_derivative_with_np(
        recovered_autoencoder_network['v2_in_z'],
        recovered_autoencoder_network['v2_in_dz'], 
        recovered_autoencoder_network['v2_in_sindy_predict'], 
        recovered_autoencoder_network['v2_in_decoder_weights'],
        recovered_autoencoder_network['v2_in_decoder_biases'])

# Here I access the tensors recovered from the saved Tensorflow model and convert them to torch.
converted_stack = [torch.tensor(recovered_autoencoder_network['v2_in_z'],
                                device=torch.device(params['target_device']),
                                dtype=params['sindy_precision']),
                       torch.tensor(recovered_autoencoder_network['v2_in_dz'],
                                device=torch.device(params['target_device']),
                                dtype=params['sindy_precision']),
                       torch.tensor(recovered_autoencoder_network['v2_in_sindy_predict'],
                                device=torch.device(params['target_device']),
                                dtype=params['sindy_precision'])]

# Here I use the tensors captured from the tensorflow model (converted to torch)
# with the torch version of the function and the layers from the model. 
dx_decode_torch_test, ddx_decode_torch_test = \
    calculate_first_and_second_derivative_with_torch(converted_stack,
        sindy_autoencoder.ψ_decoder_to_x) 

# Here I show the error between the two functions. 
print(dx_decode_np_test - dx_decode_torch_test, ddx_decode_np_test - ddx_decode_torch_test)

# Here I show that the Torch weights in the model feeding the Torch 
# function are equivalent to the Numpy arrays feeding the Numpy function. 
# (the weights were initialized from those arrays after conversion to Torch.tensor.)
print(("\n\nWeight and bias comparison for two models (imported from np source)\n\n" +
    "weights comparison: \nl1 {:.5f} ({:.2%})\nl2 {:.5f} ({:.2%})\nl3 {:.5f} ({:.2%})\nl4 {:.5f} ({:.2%})\n\n" +
    "bias comparison: \nb1 {:.5f} ({:.2%})\nb2 {:.5f} ({:.2%})\nb3 {:.5f} ({:.2%})\nb4 {:.5f} ({:.2%}))")
        .format(np.sum(sindy_autoencoder.ψ_decoder_to_x[0].weights.cpu().detach().numpy()
                 - recovered_autoencoder_network['v2_in_decoder_weights'][0]),
                  (np.sum(sindy_autoencoder.ψ_decoder_to_x[0].weights.cpu().detach().numpy()
                          - recovered_autoencoder_network['v2_in_decoder_weights'][0]))
                  / np.sum(recovered_autoencoder_network['v2_in_decoder_weights'][0]),
                  (np.sum(sindy_autoencoder.ψ_decoder_to_x[1].weights.cpu().detach().numpy()
                          - recovered_autoencoder_network['v2_in_decoder_weights'][1])),
                  (np.sum(sindy_autoencoder.ψ_decoder_to_x[1].weights.cpu().detach().numpy()
                          - recovered_autoencoder_network['v2_in_decoder_weights'][1]))
                  / np.sum(recovered_autoencoder_network['v2_in_decoder_weights'][1]),
                  (np.sum(sindy_autoencoder.ψ_decoder_to_x[2].weights.cpu().detach().numpy()
                          - recovered_autoencoder_network['v2_in_decoder_weights'][2])),
                  (np.sum(sindy_autoencoder.ψ_decoder_to_x[2].weights.cpu().detach().numpy()
                          - recovered_autoencoder_network['v2_in_decoder_weights'][2]))
                  / np.sum(recovered_autoencoder_network['v2_in_decoder_weights'][2]),
                  (np.sum(sindy_autoencoder.ψ_decoder_to_x[3].weights.cpu().detach().numpy()
                          - recovered_autoencoder_network['v2_in_decoder_weights'][3])),
                  (np.sum(sindy_autoencoder.ψ_decoder_to_x[3].weights.cpu().detach().numpy()
                          - recovered_autoencoder_network['v2_in_decoder_weights'][3]))
                  / np.sum(recovered_autoencoder_network['v2_in_decoder_weights'][3]),
                  (np.sum(sindy_autoencoder.ψ_decoder_to_x[0].bias.cpu().detach().numpy()
                          - recovered_autoencoder_network['v2_in_decoder_biases'][0])),
                  (np.sum(sindy_autoencoder.ψ_decoder_to_x[0].bias.cpu().detach().numpy()
                          - recovered_autoencoder_network['v2_in_decoder_biases'][0]))
                  / np.sum(recovered_autoencoder_network['v2_in_decoder_biases'][0]),
                  (np.sum(sindy_autoencoder.ψ_decoder_to_x[1].bias.cpu().detach().numpy()
                          - recovered_autoencoder_network['v2_in_decoder_biases'][1])),
                  (np.sum(sindy_autoencoder.ψ_decoder_to_x[1].bias.cpu().detach().numpy()
                          - recovered_autoencoder_network['v2_in_decoder_biases'][1]))
                  / np.sum(recovered_autoencoder_network['v2_in_decoder_biases'][1]),
                  (np.sum(sindy_autoencoder.ψ_decoder_to_x[2].bias.cpu().detach().numpy()
                          - recovered_autoencoder_network['v2_in_decoder_biases'][2])),
                  (np.sum(sindy_autoencoder.ψ_decoder_to_x[2].bias.cpu().detach().numpy()
                          - recovered_autoencoder_network['v2_in_decoder_biases'][2]))
                  / np.sum(recovered_autoencoder_network['v2_in_decoder_biases'][2]),
                  (np.sum(sindy_autoencoder.ψ_decoder_to_x[3].bias.cpu().detach().numpy()
                          - recovered_autoencoder_network['v2_in_decoder_biases'][3])),
                  (np.sum(sindy_autoencoder.ψ_decoder_to_x[3].bias.cpu().detach().numpy()
                          - recovered_autoencoder_network['v2_in_decoder_biases'][3]))
                  / np.sum(recovered_autoencoder_network['v2_in_decoder_biases'][3])))

ptrblck · August 16, 2022, 4:19am

Could you post more information about your system, used PyTorch version etc.?

Simon_Watson · August 16, 2022, 4:56am

Hi @ptrblck -
I am using Ubuntu 22.04 and an anaconda environment. My GPU is an NVIDIA RTX A4500. I’m using a custom build for a uni project and if it helps, I have Ubuntu Advantage support if there is a chance this is an issue with the OS. (Although I remain confident the eventual diagnosis will be ‘user error’. Just keen to find out which error I need to correct)

I have attached a screenshot from my NVIDIA server settings and my Ubuntu settings. Below are the Pytorch details I get from my python environment. I also use PyTorch to write my code.

Simon

(pytorch) simon@infodynamics001:~$ python3
Python 3.9.12 (main, Apr  5 2022, 06:56:58) 
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.__version__)
1.11.0
>>> torch.version.cuda
'11.3'
>>>

ptrblck · August 16, 2022, 4:57am

Could you update to 1.12.0 or disable TF32 via torch.backends.cuda.matmul.allow_tf32 = False and recheck the results?
TF32 is disabled by default for matmuls as it was confusing users in exactly the use cases you are describing here.

Simon_Watson · August 16, 2022, 5:18am

Thanks @ptrblck,
It’s installing now using the below on a fresh conda environment.

conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

I will advise once I’ve rerun.
Simon

Simon_Watson · August 16, 2022, 6:19am

Hi @ptrblck,
I’m afraid the error still remains. However, although the math has been run on a function defined using 1.12.0 and with the backends code you advised, it accesses the bias and weights from a model saved prior to that change. That model was only initialised (so no matmults) but the initialisation was to the numpy values used here via a transform to torch.tensor. In particular, I note the 4rth layer has an aggregate sum of weight values that is 1.92% different to the Numpy values is was supposed to have been initialised with. I’ve included the code I use to initialise the final layer here (there were other initialisation methods but since they aren’t triggered, I deleted them. That Numpy tensor it is initialised with is the one that gets used in the comparison I’ve sent previously. Could the previous issue have impacted the initialisation as well? I’d be happy to save down the model again with the new PyTorch and the extra cuda.matmul line. (I think that was optional after the PyTorch upgrade but I thought it wouldn’t harm anything)

I confirm that:
[*params[‘autoencoder_weight_initialization’]][0] == ‘state_dictionary’
params[‘target_device’] == ‘cuda’
and
params[‘weight_precision’] == torch.float64

Simon


import numpy as np
import torch
from torch import nn, empty

class SINDYAutoencoderFinalLayer(nn.Module):
    """
    Custom layer performs matrix multiplication on three separate activation state channels. 
    dz and ddz do not have an associated bias. 
    The channels are separated since each requires a different treatment based on the same 
    set of weights used for each.
    """

    def __init__(self, size_in, size_out, params, ref):
        super().__init__()
        self.ref = ref
        self.size_in, self.size_out = size_in, size_out

        # this parameter determines whether the model learns the first or second derivative with SINDY.
        self.model_order = params['model_order']

        # now initialise weights and bias.
        weights = empty((size_in, size_out), device=torch.device(params['target_device']),
                        dtype=params['weight_precision'])
        bias = empty(size_out, device=torch.device(params['target_device']),
                     dtype=params['weight_precision'])

        # {'state_dictionary': None}
        if [*params['autoencoder_weight_initialization']][0] == 'state_dictionary':
            weights = torch.tensor(params['state_dictionary'][[*ref][0] + '_W' + str([*ref.values()][0]) + ':0'],
                                                  device=torch.device(params['target_device']),
                                                  dtype=params['weight_precision'])
            bias = torch.tensor(params['state_dictionary'][[*ref][0] + '_b' + str([*ref.values()][0]) + ':0'],
                                device=torch.device(params['target_device']),
                                                                  dtype=params['weight_precision'])

        self.weights = nn.Parameter(weights)
        self.bias = nn.Parameter(bias)

    def forward(self, x_base):
        """
        This method describes how the variables in the layer are manipulated by the Forward process. 
        Note that the element pushed through this process is the base 'z' from data. 
        The first (and second) derivatives are not used.
        """

        # apply weights and bias of this layer to activation states from the last layer
        # Note that for this final layer, no activation function is applied.
        return torch.mm(x_base, self.weights) + self.bias

ptrblck · August 16, 2022, 6:35am

That’s interesting as it would point towards the initialization itself. Are you using the default float64 dtype in numpy and are transforming it then to float32 in PyTorch? How are you measuring the relative error of these tensors?

Simon_Watson · August 16, 2022, 10:42am

Hi @ptrblck,

I convert a float32 in numpy to float 64 in PyTorch. I know this is the wrong way around - but the float32 in TF/Numpy is good enough to avoid producing an ‘error’ and allows the model to produce gradients well enough to train. If I use tf versions of the function (outputs provided earlier in the chain), I get the same result as the Numpy result. This is what I compare to with the numpy function. When I use the Torch versions of the functions I get a sum of weights and biases that is 10% higher and ‘wrong’ so the model is not able to produce ‘trainable’ gradients fo parts fed by that function.

Comparing the Torch and TF models with the trainable tensors initialised from a ‘mid train’ Tensorflow model , I see agreement to around 10e-04 or -05. However, for this specific set (the decoder section of an autoencoder) the outputs are around 10% out (ie 10e-01 compared to the above).

if you have any ideas how I might continue to troubleshoot this, I’d be keen to hear. I tried a gradient comparison but had enough issues to warrant a separate post on the site.

Thanks and regards,

Simon

ptrblck · August 17, 2022, 5:25am

If I understand the use case correctly you are not training the models at all yet, but are only comparing the loaded parameters and are seeing a larger mismatch in a specific layer. If that’s the case, are you sure you are assigning the right values to the parameters of this particular layer and are not just randomly initializing it?

Simon_Watson · August 17, 2022, 6:07am

Hi @ptrblck ,
I have trained both the Tensorflow model and the Torch version. These investigations are because on the Torch one, I get a fail. The loss calculation involves a number of terms. As far as I can see, the term that depends on the function I’ve mentioned previously isn’t improving over the cycle and eventually comes to dominate the overall loss.
On the parameter initialisation - I initialise both the TF and Torch weights and biases with the Numpy data. Then I run through the forward process for the TF and Torch models. I compare the weight and bias aggregate sum for each of them as well as the output of the additional functions used. Broadly speaking, I have an autoencoder process with then some additional functions that generate additional losses.
The sum of the last layer of the encoding portion of the autoencoder for TF and Torch agree so that TF/Torch gives 1.0034. The other function output when taken as a ratio for TF output/Torch output gives us 0.9999. The function I suspect gives a ratio of roughly 1.10 between the two approaches.
I’m now trying to view the computational graph as a whole between two models to see if there are any differences but I’ve hit another issue (posted separately).

Simon

Simon_Watson · August 17, 2022, 6:28am

Hi @ptrblck,
Attached is a view of the my loss components through the training cycle for my Torch model. As you can see 3 of the 4 do show improvement. However, the green line (sindy_x) is unchanged. This is because it is dependent on the output of that function I’d talked about and the loss is calculated by taking the difference between the input values (which are large) from the output of the trained function. Since that output is close to zero and doesn’t change, the loss is just the sum of the input.
Thanks for your interest.
Simon

ptrblck · August 19, 2022, 6:07am

Sorry, but I don’t understand the full use case and would need to check the code.
Could you post the input shapes which would be needed to execute the code and point me to the calculation or output which shows the large difference?

Simon_Watson · August 21, 2022, 9:04am

Hi @ptrblck,
I’m not being clear - I tried to balance giving needed detail against making things too complicated and I failed.

I will write a script to illustrate and add to this thread by Tuesday.

If I use the numpy function, I get an answer that agrees a tensorflow function.

If I initialise a touch sequence to the values of that set of numpy arrays, then select the equivalent torch arrays from that sequence to use instead of the numpy arrays in a torch function that is the equivalent of the numpy function, I get an answer 10% different. When I compare the tensorflow equivalent to numpy, it agrees. This is true even when the torch arrays are set to float64 and tensorflow is float32.

I’ll do that script by Tuesday.

Simon