Transferring weights from Keras to PyTorch


I have a trained model in Keras (tensorflow backend) and want to transfer those weights to a pytorch model. As I do it, the model in pytorch performance not as good as the keras model does. Even the forward propagation has differences. To nail the problem down I created a small toy example to see if this situation can be replicated. And it can easily be replicated with the given script:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = ''
import keras
import keras.backend as K
import torch
from torch import nn
import numpy as np
import random
import tensorflow as tf



def keras_to_pyt(km, pm):
    weight_dict = dict()
    for layer in km.layers:
        if type(layer) is keras.layers.convolutional.Conv2D:
            weight_dict[layer.get_config()['name'] + '.weight'] = np.transpose(layer.get_weights()[0], (3, 2, 0, 1))
            weight_dict[layer.get_config()['name'] + '.bias'] = layer.get_weights()[1]
        elif type(layer) is keras.layers.Dense:
            weight_dict[layer.get_config()['name'] + '.weight'] = np.transpose(layer.get_weights()[0], (1, 0))
            weight_dict[layer.get_config()['name'] + '.bias'] = layer.get_weights()[1]
    pyt_state_dict = pm.state_dict()
    for key in pyt_state_dict.keys():
        pyt_state_dict[key] = torch.from_numpy(weight_dict[key])
    return pm

inp = np.random.normal(size=(1, 1, 5, 6)).astype(dtype=np.float32)

inp_pyt = torch.autograd.Variable(torch.from_numpy(inp.copy()).float())
inp_keras = np.transpose(inp.copy(), (0, 2, 3, 1))

a = keras.Input(shape=(5, 6, 1), name='input')
b = keras.layers.Conv2D(2, (3, 4), activation='linear', padding='same', name='conv_1', bias_initializer='random_uniform')(a)
keras_model = keras.models.Model(inputs=a, outputs=b)

class PyNet(nn.Module):
    def __init__(self):
        super(PyNet, self).__init__()
        self.conv_1 = nn.Conv2d(in_channels=1, out_channels=2, kernel_size=(3, 4), padding=0)
    def forward(self, x):
        return self.conv_1(nn.ZeroPad2d((1, 2, 1, 1))(x))

pyt_model = PyNet()

keras_result = keras_model.predict(x=inp_keras, verbose=1)
pyt_model = keras_to_pyt(keras_model, pyt_model)
pyt_res = np.transpose(pyt_model(inp_pyt).data.numpy(), (0, 2, 3, 1))

for i in range(1):
    for j in range(5):
        for k in range(6):
            for l in range(2):
                print(keras_result[i, j, k, l], pyt_res[i, j, k, l])

The given script prints the following results:
(-0.044850744, -0.044850744)
(-0.006462127, -0.0064621493)
(0.017427735, 0.017427728)
(0.37132108, 0.37132108)
(-0.23686403, -0.23686403)
(-0.90041882, -0.90041882)
(-0.065025821, -0.065025814)
(0.23733595, 0.23733595)
(-0.11710706, -0.11710706)
(0.24237445, 0.24237445)
(-0.061176896, -0.061176896)
(-0.056127474, -0.056127474)
(0.11819147, 0.11819144)
(0.10230125, 0.10230123)
(0.60257965, 0.60257971)
(0.91219217, 0.91219229)
(-0.21988741, -0.21988741)
(-0.94492501, -0.94492507)
(-0.25544429, -0.25544426)
(0.80861402, 0.80861402)
(-0.32262391, -0.32262391)
(0.29116583, 0.2911658)
(-0.063009739, -0.063009739)
(-0.099875651, -0.099875651)
(0.34689391, 0.34689391)
(0.86204314, 0.86204308)
(-0.15171689, -0.1517169)
(0.54282308, 0.5428232)
(0.002491869, 0.0024918541)
(-0.43892303, -0.43892306)
(0.25317714, 0.25317714)
(-0.15906075, -0.15906072)
(-0.12131988, -0.12131988)
(0.27651906, 0.27651903)
(0.19103783, 0.19103783)
(-0.28911468, -0.28911468)
(-0.8152504, -0.8152504)
(0.62633103, 0.62633109)
(-0.70274156, -0.70274156)
(-0.22379526, -0.22379526)
(0.043730669, 0.043730669)
(-0.87990582, -0.87990582)
(-0.27177343, -0.27177346)
(-0.0016308948, -0.0016309395)
(0.48494944, 0.48494944)
(0.15391195, 0.15391195)
(-0.062737577, -0.062737577)
(-0.19160414, -0.19160414)
(0.13429669, 0.13429666)
(0.40462545, 0.40462548)
(0.65125835, 0.65125835)
(-0.49792019, -0.49792019)
(-0.1081684, -0.10816839)
(-0.20262283, -0.2026228)
(-0.37794596, -0.37794593)
(-0.21728748, -0.21728748)
(-0.33614561, -0.33614561)
(0.56259048, 0.56259054)
(0.090251423, 0.090251423)
(-0.32884693, -0.3288469)

As you can see, some of the values has some difference on the order of 10^-5. This can be a simple float point error but is consistent with every kind of model. If this is the case with one layer, the problem will simply magnify when the model is big consisting of multiple layers.

Is there any solution to losing such floating point accuracy while transferring weights? Or should I train the model in PyTorch from scratch?

difference in the order of 1e-5 is very reasonable to expect in a forward pass.

The difference could be one of many things:

  • Keras using no cudnn while pytorch is?
  • keras calling a different CuDNN kernel while pytorch is? (cudnn has many algorithms, direct convolution, FFT, Winograd, etc.)

A 1e-5 difference is not a big deal.

if you really wish to get precision down even further, try:

torch.backends.cudnn.enabled = False

See if that helps…

As was already said, I don’t think the small rounding error is a big issue.
What do you mean by:

Is the accuracy worse in PyTorch than in Keras?
Which layer types are you using? Have you set your model to eval()?

In my opinion, if it is trained in Keras, then directly transferring parameters makes sense to give worse results. Intuitively, the parameters were trained such that they give low loss in Keras because all gradients were calculated basing on Keras outputs. So in a sense they were optimized for Keras functions but not PyTorch functions.

Yes, transferring the weights from Keras to PyTorch does decrease the AUROC (which is my metric) 0.05 which is big as far as I am concerned.
And yes, I have set the model to eval before evaluating.

layers consist of Relu, conv2D, linear.

You maybe right. However, if that’s the case, I’ll have to train my model solely on PyTorch to get back the performance I am looking for.


torch.backends.cudnn.enabled = False

does decrease the variability between the numbers, but I still see some values different on the order of 1e-5. I understand this difference is small and negligible, but, this error will propagate and explode along multiple layers which is happening in the trained model.

Do you have evidence for this? A small test script?

float precision guarantees only upto 1e-6 of accuracy practically.

I sure will build a test script to validate this problem for which I’ll need some time.

But, in my current set-up with the same input,
model trained Keras gives: 0.00405179336667
model ported to PyTorch gives: 0.026722189039.

To thoroughly check if my conjecture is right. I’ll closely inspect if the input are exactly the same and create the test script accordingly.

Thank you for your prompt responses.

I’d check layer by layer. Usually there are subtle diffferences in how the weights are layed out in Pytorch vs Tensorflow.

Hi soumith,

Thank you for your prompt responses. I have identified the problem to how the input was being pre-processed before feeding into the model. there were subtle differences which were not easily identifiable by the naked eye.

Currently the results are exactly the same, even till the order of 1e-10.


Hi Tom,

Yes, this process certainly helped in identifying the problem. Thank you.

Do you know if it’s possible to use PyTorch weights with Keras?

It’s totally possible. I’ve only tried with convolutional and linear/dense layers and it’s possible. I have a small script to get this done automatically when the models in the two platforms are same with their naming scheme.

If you want, I’ll share it with you.

1 Like

Sure, I’d really appreciate that. I have some PyTorch resnet50 weights that I’d like to move over to Keras so that I can use Keras.js.

I hope this helps:

For larger larger projects you may also want to look into this:

1 Like

Much appreciated! I’ll let you know how it goes.

Hi Chirag, can you please elaborate on the issues you spotted, and how you were able to resolve these? Did you use MMdnn to do this? If not, how did you do the initial conversion? Many thanks!

Hi Siyu, they were mostly implementation differences in the train and test pipeline which resulted in such issues. Spotting these were difficult but what pointed me towards checking some old code was a simple experiment:

  1. I used the script given above to transfer weights from a Keras model to a PyTorch model.
  2. Then from the PyTorch model back to the Keras model.
  3. I continued this activity for more than 10000 steps via simple for loop.

If my conjecture that that we lose accuracy while transferring weights is correct, this whole moving weights from one platform to another would leave the model in almost a random state. But, the result was that the only accuracy I lost was in the first iteration. So that made me look in my implementation of the testing pipeline where I found my bug.

No, I didn’t use MMdnn to do my conversion as my model wasn’t that complex and the naming scheme made it easier to write up a script to transfer weights across the platform. All one requires is to have exactly the same model in the two platforms. The difference comes in what orientation (height of the image first or the width for convolutions?) the two platforms store the tensors. And thus, exploiting the naming scheme and tensor orientations I created a script to transfer weights: Transferring weights from Keras to PyTorch



I am trying to do the exact same thing 2 years layer. And i have the same problem.

Keras. version = 2.2.4
Tensorflow. version = 1.11.0
PyTorch = 1.0.0a0+ff608a9

I transformed and loaded weights from keras to pytorch equivalent model and the activity diverges quickly. Not sure where the problem is.

I see the same issue when i tried the script/example notebook in the repo:

Any suggestions/tips will be highly appreciated.