If you implement the Adam optimizer using raw Numpy and just use the TF2.x and PyTorch frameworks to compute the network outputs, loss and gradients (using the same hyperparameters and the same Xavier uniform initialization for the weights and zero initialization for the biases) you still get slightly better performance (faster convergence) usign TF2.x than PyTorch. So I think there is something else going on other than a problem with the optimizer, weight initialization or differences in hyperparameters.
I converted a TensorFlow model into PyTorch model and fixed weights and input batch and getting the same amount of loss. but this same amount of loss I am getting different gradient. I created a new topic here with a short reproducible code.
Could you please take a look.
One quick question: is it not guaranteed that for the same loss, the gradient will be the same? If not, what are the other things that I should check?
If the loss were (e.g. continuously) differentiable and any input were valid, you would expect to have the same gradient up to numerical precision.
In practice, loss functions around deep learning have different ways of dealing with non-differentiable points, e.g. the 0 point at ReLU.
Fancy people will say that there, any subgradient can be defended. One of the cases I remember is that CuDNN has gradient 1 at 0 in their RNN+ReLU implementation while you would usually not do this and in most other places, libraries set the the gradient to 0 at the 0 point. Quite likely this wasnāt as much the standard when the function was implemented in CuDNN. As things are then passed through many layers, the gradients may differ.
Best regards
Thomas
Thanks for all the useful comments above. I took the initiative to bump this issue, after spending a couple of days on the reproduction attempts of CycleGAN-VC. CycleGAN-VC seems to be not reproducible due to Adam optimizer/sparsity differences according to this issue:
- GitHub issue here
- related PyTorch forum discussion
- whole GitHub repo, but claiming PyTorch Adam to better
Iām very unfamiliar with TF, but I will do my best to get a minimal reproducible example with the two frameworks.
I spent a bit of time with it but I struggled to implement two identical matrix multiplications in TF and PyTorch, even though Iām loading the weights initalised by PyTorch in TF, same input, but mean of feedforward output seems to be different (0.003 vs -0.006).
To avoid being the guy who just āok solved, it was a stupid bugā, I want to note here that in my case the main problem was not the difference in Adam optimiser, but the fact that I used a PixelShuffler layer ported based on a tf code. It turns out tf.reshape is not identical in behaviour to pytorch.reshape, thatās what I concluded.
I ran some experiments as mentioned above, but I couldnāt isolate the problem. Initialised with the same weights and 1000 epochs iteration seems to lead to different results with the two different frameworks for me (even with SGD). I donāt have unfortunately this code anymore and Iām bit busy atm, but I encourage people to try this experiment with different optimisers. There is a behaviour difference, whether itās significant for a particular application, thatās a different question.
If someone finds otherwise (i.e same matrix multiplication initialised to same weight (not only scheme), same input/output, MSE loss, Adam, different framework, and same results) then Iām willing to dig up my old code and help, but I donāt really see incentive to address this (sadly).
I had spent days trying to solve this problem. Finally, I was able to align my loss values (exactly the same). Follow these steps:
1. Set random seeds (numpy, TF and Pytorch) to ensure consistent results while running your models.
2. Start with a small data sample (say 5 training and test data points)
3. Turn off the shuffle. Note, some iterators by default have shuffle enabled, explicitly turn them off.
4. Set the same exact hyperparameters for TF and Pytorch.
5. Make sure exactly the same preprocessing is applied for both the cases.
6. Set batch_size = 1 and number of epochs = 1. Make sure the same data is being fed (to both Pytorch and TF) in order.
7. The number of model parameters should be the same.
8. Turn off dropout. Dropout randomly switches off neurons in the model, remove it.
9. Use SGD (for Pytorch) and GradientDescentOptimizer (for TF) as optimizers. NOTE: Adam Optimizer is implemented slightly differently in TF and Pytorch and can cause difference in losses.
The losses should be aligned (exactly same).
Thank you everyone for the helpful ideas. The first bug I found was that, given the different axis order in convolutional networks, the reshape
function will produce different results. For example
[[1, 2, 3], [4, 5, 6]]
would become
[1, 2, 3, 4, 5, 6]
versus
[1, 4, 2, 5, 3, 6].
Even after fixing this, though, the PyTorch model does not converge as well as the TF1 model. The DNN has multiple convolutional and transposed convolutional layers. I did the following:
1. Save TF1 weights to file and load them into PyTorch, so the parameters are identical
2. Use the same loss, Adam optimizer hyperparameters
3. Pre-process a single data sample and save it to file
4. Load the sample into program, compute forward pass and backward pass, save data and gradient at each layer
5. Find that the data and gradients look identical in plots, and that the difference ranges from 1e-6 to 1e-10 (i.e. they are identical within bounds of numerical precision)
As far as I know, the two dataloaders are identical. Just to be sure, Iāll try training both models for a few epochs on a single data sample to check whether they converge the same way.
Any thoughts on what to try next would be helpful! Did most people on this thread give up, or fix their problem?
Hi everyone, I checked the following:
- Loaded identical weights from file to initialize PyTorch and TF models. Loaded the same data sample from numpy into both programs. Ran a single forward-backward pass with the Adam optimizer. The intermediary data layers and gradients were within 1e-6 to 1e-10 of each other. Plots looked identical (after transposition).
- Loaded identical weights from file to initialize PyTorch and TF models. Trained and tested, always loading the same data sample. Used 1000 iterations for training and 1 for test, batch size. Used SGD instead of Adam. Losses were identical.
# PyTorch
Finished debug testing - MSE: 0.1504615843296051
Finished debug testing - MSE: 0.10858417302370071
Finished debug testing - MSE: 0.08603279292583466
# TensorFlow
Finished debug testing - Mean MSE: 0.15046157
Finished debug testing - Mean MSE: 0.108584
Finished debug testing - Mean MSE: 0.08603277
- Did everything exactly the same as 2., except switching to the Adam optimizer. PyTorch performs worse.
# PyTorch:
Finished debug testing - MSE: 0.0031117501202970743
Finished debug testing - MSE: 0.0020642257295548916
Finished debug testing - MSE: 0.0019268309697508812
Finished debug testing - MSE: 0.0016333406092599034
Finished debug testing - MSE: 0.0017334128497168422
Finished debug testing - MSE: 0.0014430736191570759
Finished debug testing - MSE: 0.0010424457723274827
Finished debug testing - MSE: 0.0012145100627094507
Finished debug testing - MSE: 0.0011195113183930516
Finished debug testing - MSE: 0.0009501167223788798
Finished debug testing - MSE: 0.0009987876983359456
Finished debug testing - MSE: 0.0007953296881169081
Finished debug testing - MSE: 0.00075263757025823
Finished debug testing - MSE: 0.0008374055614694953
Finished debug testing - MSE: 0.000735406531020999
# TensorFlow:
Finished debug testing - Mean MSE: 0.0036667113
Finished debug testing - Mean MSE: 0.0032563617
Finished debug testing - Mean MSE: 0.0021536187
Finished debug testing - Mean MSE: 0.0015266595
Finished debug testing - Mean MSE: 0.0013580231
Finished debug testing - Mean MSE: 0.0013878695
Finished debug testing - Mean MSE: 0.0011856346
Finished debug testing - Mean MSE: 0.0011136091
Finished debug testing - Mean MSE: 0.00091276
Finished debug testing - Mean MSE: 0.000890126
Finished debug testing - Mean MSE: 0.00088381825
Finished debug testing - Mean MSE: 0.0007283067
Finished debug testing - Mean MSE: 0.00081382995
Finished debug testing - Mean MSE: 0.0006670901
Finished debug testing - Mean MSE: 0.00046282331
Details:
TF 1.15.3.
adam_optimizer = tf.train.AdamOptimizer(learning_rate=5e-5)
# default parameters from the documentation at https://github.com/tensorflow/tensorflow/blob/v1.15.0/tensorflow/python/training/adam.py#L32-L235:
# learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8, use_locking=False, name="Adam")
PyTorch 1.8.1.
torch.optim.Adam(params=model.parameters(), lr=5e-5, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.0)
Model:
This model is fully convolutional and contains TransposedConv2D layers.
def train(...):
...
checkpoint = torch.load(checkpoint_file, map_location=device)
model.load_state_dict(checkpoint['state_dict'])
optimizer.load_state_dict(checkpoint['optimizer'])
...
counter = 0
while run:
counter += 1
if counter > 1000:
break
in = np.load("debug_data/in.npy")
out1 = np.load("debug_data/out1.npy")
out2 = np.load("debug_data/out2.npy")
# adjust from TF
in = in.squeeze(3)
in = np.expand_dims(in, axis=0)
... do the same for out1 and out2
in, out1, out2 = \
torch.from_numpy(in).to(device), \
torch.from_numpy(out1).to(device), \
torch.from_numpy(out2).to(device)
optimizer.zero_grad()
out1_hat, out2_hat = model(in)
train_loss = loss_fn(out1_hat, out1) + loss_fn(out2_hat, out2)
train_loss.backward()
optimizer.step()
save_checkpoint({'state_dict': model.state_dict(),
'optimizer': optimizer.state_dict()},
latest_filename=latest_checkpoint_path)
Same issue here. But suggest Pytorch developers revise their optimizers based on Tensorflow. If I need to use Tensorflow to check the accuracy of Pytorch, why do I need Pytorch for my projects?
Hi, i got some suspects before but i checked them and it worked. LSTM in TF inits weights with xavier_uniform but not PyTorch. After i saw TF model converged faster and with better score with same model on pytorch. Then i found some posts on internet and add this code to my init(): on my model:
def _weights_init(m):
if isinstance(m, nn.LSTM):
nn.init.xavier_normal_(m.weight_ih_l0)
nn.init.xavier_normal_(m.weight_hh_l0)
nn.init.xavier_normal_(m.weight_ih_l0_reverse)
nn.init.xavier_normal_(m.weight_hh_l0_reverse)
self.apply(_weights_init)
and it work as tensorflow now. Btw eps=1e-8 and (default tf eps=1e-7) is also changing a bit score. Gamechanger was the LSTM weight inits to xavier_uniform.
@lukasz_borecki Did you use nn.init.xavier_normal_ or nn.init.xavier_uniform_ ?
You are comparing apples to oranges here. Does Pytorch achieve similar results with TF if the same learning rate decay is applied to both models???
@sushmit_roy sorry for late reply no reminder on my email:
def _weights_init(m):
if isinstance(m, (nn.LSTM, nn.GRU)):
nn.init.xavier_uniform_(m.weight_ih_l0)
nn.init.orthogonal_(m.weight_hh_l0)
nn.init.xavier_uniform_(m.weight_ih_l0_reverse)
nn.init.orthogonal_(m.weight_hh_l0_reverse)
nn.init.xavier_uniform_(m.weight_ih_l1)
nn.init.orthogonal_(m.weight_hh_l1)
nn.init.xavier_uniform_(m.weight_ih_l1_reverse)
nn.init.orthogonal_(m.weight_hh_l1_reverse)
This was the code i used to init weights on biLSTM
Greetings