Suboptimal convergence when compared with TensorFlow model

vitchyr · May 17, 2019, 1:44am

I know this is a long shot, but did you ever get around to testing this? I’m having similar issues reproducing TensorFlow results with PyTorch, and I currently suspect that it’s due to the Adam optimizer.

baboonga · May 21, 2019, 1:51pm

I think this weight initialization is a big issue one.
I tested the same implementation on both pytorch and keras and I found that I got a suboptimal problem.

I once accidentally implemented keras in a different way ( I don’t remember how I implemented at that moment.) and I got a suboptimal issue on keras.

So I made a conclusion that it would be the weight initialization that caused the suboptimal problem.
This may apply to pytorch as well.

tengerye · October 24, 2019, 3:31am

@tom Hi, have you found out the real problem? I have similar issue as well and I don’t believe in accidents.

tom · October 24, 2019, 8:41am

Well, so I converted a few models (e.g. StyleGAN with in joint work with @ptrblck ) with the above method, and it works very well for me.

It should be said that “I ported my program to X and now it’s not doing the same as before” isn’t a single thing, just like - you brought up accidents - “there was a car crash” doesn’t have a unique cause that you can find and solve.

Some people are very good at getting these things (sees e.g. HuggingFace’s awesome transformer work) to work similarly across frameworks, but it is a skill you have to build.

Best regards

Thomas

ChengchengWei · December 24, 2019, 1:14pm

The same.

I set optim.lr_scheduler.StepLR(optimizer, step_size=50, gamma=0.2) for pytorch, and luckily the performance is comparable with tensorflow.

And I do not know the reason.

alviur · February 4, 2020, 5:55pm

Same issue here with a custom layer made in TF and ported to Pytorch

hrdeepak · April 19, 2020, 10:01am

In my experiment, however, I followed these to and ended up with similar results:

Used nn.init.xavier_uniform_ for weights and nn.constant_ for the biases.
In the adam optimizer, PyTorch uses default eps=1e-8 vs TensorFlow’s epsilon=1e-7.Changed it to 1e-7

Hope this helps

jlloyd237 · April 24, 2020, 11:22am

If you implement the Adam optimizer using raw Numpy and just use the TF2.x and PyTorch frameworks to compute the network outputs, loss and gradients (using the same hyperparameters and the same Xavier uniform initialization for the weights and zero initialization for the biases) you still get slightly better performance (faster convergence) usign TF2.x than PyTorch. So I think there is something else going on other than a problem with the optimizer, weight initialization or differences in hyperparameters.

Mainul · May 27, 2020, 8:12pm

I converted a TensorFlow model into PyTorch model and fixed weights and input batch and getting the same amount of loss. but this same amount of loss I am getting different gradient. I created a new topic here with a short reproducible code.
Could you please take a look.

One quick question: is it not guaranteed that for the same loss, the gradient will be the same? If not, what are the other things that I should check?

tom · May 29, 2020, 6:20am

If the loss were (e.g. continuously) differentiable and any input were valid, you would expect to have the same gradient up to numerical precision.
In practice, loss functions around deep learning have different ways of dealing with non-differentiable points, e.g. the 0 point at ReLU.
Fancy people will say that there, any subgradient can be defended. One of the cases I remember is that CuDNN has gradient 1 at 0 in their RNN+ReLU implementation while you would usually not do this and in most other places, libraries set the the gradient to 0 at the 0 point. Quite likely this wasn’t as much the standard when the function was implemented in CuDNN. As things are then passed through many layers, the gradients may differ.

Best regards

Thomas

karkirowle · July 26, 2020, 5:03pm

Thanks for all the useful comments above. I took the initiative to bump this issue, after spending a couple of days on the reproduction attempts of CycleGAN-VC. CycleGAN-VC seems to be not reproducible due to Adam optimizer/sparsity differences according to this issue:

I’m very unfamiliar with TF, but I will do my best to get a minimal reproducible example with the two frameworks.

I spent a bit of time with it but I struggled to implement two identical matrix multiplications in TF and PyTorch, even though I’m loading the weights initalised by PyTorch in TF, same input, but mean of feedforward output seems to be different (0.003 vs -0.006).

karkirowle · September 5, 2020, 7:18pm

To avoid being the guy who just “ok solved, it was a stupid bug”, I want to note here that in my case the main problem was not the difference in Adam optimiser, but the fact that I used a PixelShuffler layer ported based on a tf code. It turns out tf.reshape is not identical in behaviour to pytorch.reshape, that’s what I concluded.

I ran some experiments as mentioned above, but I couldn’t isolate the problem. Initialised with the same weights and 1000 epochs iteration seems to lead to different results with the two different frameworks for me (even with SGD). I don’t have unfortunately this code anymore and I’m bit busy atm, but I encourage people to try this experiment with different optimisers. There is a behaviour difference, whether it’s significant for a particular application, that’s a different question.

If someone finds otherwise (i.e same matrix multiplication initialised to same weight (not only scheme), same input/output, MSE loss, Adam, different framework, and same results) then I’m willing to dig up my old code and help, but I don’t really see incentive to address this (sadly).

Abhilash_Srivastava · September 10, 2020, 9:20am

I had spent days trying to solve this problem. Finally, I was able to align my loss values (exactly the same). Follow these steps:

1. Set random seeds (numpy, TF and Pytorch) to ensure consistent results while running your models.
2. Start with a small data sample (say 5 training and test data points)
3. Turn off the shuffle. Note, some iterators by default have shuffle enabled, explicitly turn them off.
4. Set the same exact hyperparameters for TF and Pytorch. 
5. Make sure exactly the same preprocessing is applied for both the cases.
6. Set batch_size = 1 and number of epochs = 1. Make sure the same data is being fed (to both Pytorch and TF) in order.
7. The number of model parameters should be the same.
8. Turn off dropout. Dropout randomly switches off neurons in the model, remove it.
9. Use SGD (for Pytorch) and GradientDescentOptimizer (for TF) as optimizers. NOTE: Adam Optimizer is implemented slightly differently in TF and Pytorch and can cause difference in losses.

The losses should be aligned (exactly same).

sannawag · May 14, 2021, 5:33pm

Thank you everyone for the helpful ideas. The first bug I found was that, given the different axis order in convolutional networks, the reshape function will produce different results. For example

[[1, 2, 3], [4, 5, 6]]

would become

[1, 2, 3, 4, 5, 6]

versus

[1, 4, 2, 5, 3, 6].

Even after fixing this, though, the PyTorch model does not converge as well as the TF1 model. The DNN has multiple convolutional and transposed convolutional layers. I did the following:

1. Save TF1 weights to file and load them into PyTorch, so the parameters are identical
2. Use the same loss, Adam optimizer hyperparameters
3. Pre-process a single data sample and save it to file
4. Load the sample into program, compute forward pass and backward pass, save data and gradient at each layer
5. Find that the data and gradients look identical in plots, and that the difference ranges from 1e-6 to 1e-10 (i.e. they are identical within bounds of numerical precision)

As far as I know, the two dataloaders are identical. Just to be sure, I’ll try training both models for a few epochs on a single data sample to check whether they converge the same way.

Any thoughts on what to try next would be helpful! Did most people on this thread give up, or fix their problem?

sannawag · May 17, 2021, 9:54pm

Hi everyone, I checked the following:

Loaded identical weights from file to initialize PyTorch and TF models. Loaded the same data sample from numpy into both programs. Ran a single forward-backward pass with the Adam optimizer. The intermediary data layers and gradients were within 1e-6 to 1e-10 of each other. Plots looked identical (after transposition).
Loaded identical weights from file to initialize PyTorch and TF models. Trained and tested, always loading the same data sample. Used 1000 iterations for training and 1 for test, batch size. Used SGD instead of Adam. Losses were identical.

# PyTorch
Finished debug testing - MSE: 0.1504615843296051
Finished debug testing - MSE: 0.10858417302370071
Finished debug testing - MSE: 0.08603279292583466
# TensorFlow
Finished debug testing - Mean MSE: 0.15046157
Finished debug testing - Mean MSE: 0.108584
Finished debug testing - Mean MSE: 0.08603277

Did everything exactly the same as 2., except switching to the Adam optimizer. PyTorch performs worse.

# PyTorch:
Finished debug testing - MSE: 0.0031117501202970743
Finished debug testing - MSE: 0.0020642257295548916
Finished debug testing - MSE: 0.0019268309697508812
Finished debug testing - MSE: 0.0016333406092599034
Finished debug testing - MSE: 0.0017334128497168422
Finished debug testing - MSE: 0.0014430736191570759
Finished debug testing - MSE: 0.0010424457723274827
Finished debug testing - MSE: 0.0012145100627094507
Finished debug testing - MSE: 0.0011195113183930516
Finished debug testing - MSE: 0.0009501167223788798
Finished debug testing - MSE: 0.0009987876983359456
Finished debug testing - MSE: 0.0007953296881169081
Finished debug testing - MSE: 0.00075263757025823
Finished debug testing - MSE: 0.0008374055614694953
Finished debug testing - MSE: 0.000735406531020999
# TensorFlow:
Finished debug testing - Mean MSE: 0.0036667113
Finished debug testing - Mean MSE: 0.0032563617
Finished debug testing - Mean MSE: 0.0021536187
Finished debug testing - Mean MSE: 0.0015266595
Finished debug testing - Mean MSE: 0.0013580231
Finished debug testing - Mean MSE: 0.0013878695
Finished debug testing - Mean MSE: 0.0011856346
Finished debug testing - Mean MSE: 0.0011136091
Finished debug testing - Mean MSE: 0.00091276
Finished debug testing - Mean MSE: 0.000890126
Finished debug testing - Mean MSE: 0.00088381825
Finished debug testing - Mean MSE: 0.0007283067
Finished debug testing - Mean MSE: 0.00081382995
Finished debug testing - Mean MSE: 0.0006670901
Finished debug testing - Mean MSE: 0.00046282331

Details:
TF 1.15.3.

adam_optimizer = tf.train.AdamOptimizer(learning_rate=5e-5)

# default parameters from the documentation at https://github.com/tensorflow/tensorflow/blob/v1.15.0/tensorflow/python/training/adam.py#L32-L235:
# learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8, use_locking=False, name="Adam")

PyTorch 1.8.1.
torch.optim.Adam(params=model.parameters(), lr=5e-5, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.0)

Model:
This model is fully convolutional and contains TransposedConv2D layers.

def train(...):
    ...
    checkpoint = torch.load(checkpoint_file, map_location=device)
    model.load_state_dict(checkpoint['state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer'])
    ...
    counter = 0
    while run:
            counter += 1
            if counter > 1000:
                break

            in = np.load("debug_data/in.npy")
            out1 = np.load("debug_data/out1.npy")
            out2 = np.load("debug_data/out2.npy")

            # adjust from TF
            in = in.squeeze(3)
            in = np.expand_dims(in, axis=0)
            ... do the same for out1 and out2

        in, out1, out2 = \
                torch.from_numpy(in).to(device), \
                torch.from_numpy(out1).to(device), \
                torch.from_numpy(out2).to(device)

        optimizer.zero_grad()
        out1_hat, out2_hat = model(in)

        train_loss = loss_fn(out1_hat, out1) + loss_fn(out2_hat, out2)
        train_loss.backward()

        optimizer.step()

    save_checkpoint({'state_dict': model.state_dict(),
                    'optimizer': optimizer.state_dict()},
                    latest_filename=latest_checkpoint_path)

@tom @ptrblck Your opinion would be helpful!

Pyr · September 10, 2021, 3:14pm

Same issue here. But suggest Pytorch developers revise their optimizers based on Tensorflow. If I need to use Tensorflow to check the accuracy of Pytorch, why do I need Pytorch for my projects?

lukasz_borecki · October 6, 2021, 3:55pm

Hi, i got some suspects before but i checked them and it worked. LSTM in TF inits weights with xavier_uniform but not PyTorch. After i saw TF model converged faster and with better score with same model on pytorch. Then i found some posts on internet and add this code to my init(): on my model:

        def _weights_init(m):
            if isinstance(m, nn.LSTM):
                nn.init.xavier_normal_(m.weight_ih_l0)
                nn.init.xavier_normal_(m.weight_hh_l0)
                nn.init.xavier_normal_(m.weight_ih_l0_reverse)
                nn.init.xavier_normal_(m.weight_hh_l0_reverse)

        self.apply(_weights_init)

and it work as tensorflow now. Btw eps=1e-8 and (default tf eps=1e-7) is also changing a bit score. Gamechanger was the LSTM weight inits to xavier_uniform.

sushmit_roy · February 23, 2022, 5:24am

@lukasz_borecki Did you use nn.init.xavier_normal_ or nn.init.xavier_uniform_ ?

ProteinGuy · September 4, 2022, 1:16pm

You are comparing apples to oranges here. Does Pytorch achieve similar results with TF if the same learning rate decay is applied to both models???

lukasz_borecki · October 6, 2022, 4:44am

@sushmit_roy sorry for late reply no reminder on my email:

        def _weights_init(m):
            if isinstance(m, (nn.LSTM, nn.GRU)):
                nn.init.xavier_uniform_(m.weight_ih_l0)
                nn.init.orthogonal_(m.weight_hh_l0)
                nn.init.xavier_uniform_(m.weight_ih_l0_reverse)
                nn.init.orthogonal_(m.weight_hh_l0_reverse)
                nn.init.xavier_uniform_(m.weight_ih_l1)
                nn.init.orthogonal_(m.weight_hh_l1)
                nn.init.xavier_uniform_(m.weight_ih_l1_reverse)
                nn.init.orthogonal_(m.weight_hh_l1_reverse)

This was the code i used to init weights on biLSTM

Greetings