Why can't I reimplement my tensorflow model with pytorch?

kkpytorch · May 4, 2019, 10:35am

I am developing a model in tensorflow and find that it is good on my specific evaluation method. But when I transfer to pytorch, I can’t achieve the same results. I have checked the model architecture, the weight init method, the lr schedule, the weight decay, momentum and epsilon used in BN layer, the optimizer, and the data preprocessing. All things are the same. But I can’t get the same results as in tensorflow. Anybody have met the same problem?

InnovArul · May 4, 2019, 10:38am

You may need to give specific details about different components of your project. (data, model, some code etc). Without that, it’s difficult to point the problem.

KFrank · May 4, 2019, 2:16pm

Hi kk!

I built a simple (just for practice) network in tensorflow – two
fully-connected layers with biases. I successfully rewrote it in
pytorch and got statistically the same results.

I could not find matching built-in weight-initialization methods.
(The weights and/or biases differed one way or another.) So I
initialized my weights and biases by hand.

Tensorflow was choosing my batches (stochastic gradient
descent), while in pytorch I was choosing my own. I didn’t
bother trying to make the batch selection match exactly.
So specific numbers generated by tensorflow and pytorch
differed randomly, but, as I said above, they were statistically
the same.

(In my case, it took me a little while to realize that the seemingly
same weight-initialization methods has slightly different recipes
for initializing the biases.)

Yours is a much more complicated network than my toy model,
but I do believe that you should be able to get them to give you
the same results.

Good luck.

K. Frank

kkpytorch · May 5, 2019, 3:09am

Thank you, KFrank.
In my case, I have trained the model in TF repeatedly. I find that even I remove color augmentation, the results remain good. But in pytorch, the results are all bad.
There must exists something I don’t know affecting the results.
Now I plan to try some small network and check the models again.
I am not the only one meeting this problem.

kkpytorch · May 6, 2019, 8:49am

Here， I have dumped all the tensorflow model weights into pytorch model.
I use a simple input in test and find that the forward pass and the backward pass are all the same.
In my case, I use resnet110 as the backbone. The dataset is cifar10.
The only difference is that I use cifar10 bin version in TF and cifar10 python version in pytorch.
Except this, the two models use the same initial weight, the same optimizer (SGD with momentum 0.9), the same batch normalization layer with momentum=0.99 and epsilon=1e-3, the same lr schedule and weight decay=2e-4.
But the results in pytorch are still worse than that in TF.
I really don’t know why.

Oli · May 8, 2019, 11:03am

Hi,

I found two differences that might be relevant. The first one is different epsilon for the batch norm layers, mentioned in your first post. The second one is how the padding works. Pytorch add equal padding on all the sides whilst tensorflow’s “same” padding will not do this. This goes into effect when stride=2 for resnet.

If you have compared a specific input in both of the models and seen that they produce the same output I guess that this is not the problem.

Let us know if you figure it out

kkpytorch · May 8, 2019, 11:13am

Thank you for you reply.
You are right. But I have found these points too.
The bn momentum and bn epsilon are different.
I also write my own padding like that in TF before conv layer with stride=2.
I have test the two models with the same weights and the same input tensor.
I follow each model layer by layer in the forward pass. All tensors are the same.
For the backward pass, I only checked one step of update and find that the first conv layer after update are still the same.
Until now, I have also tested to use the pytorch dataset to feed data into tensorflow model (a wired thing). After training, the evaluation result is good. So dataset is not the culprit.
Besides, I use SGD with momentum =0.9 and weight decay factor=2e-4.
I really don’t know what else hyperparameters I can set.

Oli · May 8, 2019, 11:28am

Are you putting the model in train() and val() mode?

You can check that the outputs are the same for the two models after one weight update.

kkpytorch · May 8, 2019, 11:36am

Yes, I set model.train() and model.eval().
I also checked the outpus, they are the same on the same input.
I also execute one step of the optimization method.
After update, the first conv layer kernel keep the same.
I did’t execute more steps.

kkpytorch · May 8, 2019, 11:44am

I have trained the TF model for more than fifty times.
In all cases, the results are good enough.
I also trained the pytorch model for nearly twenty times. All results are not so good.
Until now, the only thing I have not checked is the optimization procedure.
But I have checked the SGD code of pytorch, the weight decay is implemented sightly different with TF.
In TF, the L2 loss is added into the total loss.
In pytorch, the L2 loss is directly implemented in the grad manipulation.
Theoretically, they are the same.
But I don’t know whether these two implementation have difference for the sake of precision or approximation.

Oli · May 8, 2019, 11:50am

Oh yeah, I remember reading something about that. Couldn’t find a link though

Perhaps try to train for a bit without weight decay to see if that actually is the error?

kkpytorch · May 8, 2019, 12:11pm

Do you really read something about the difference between these two implementations?
In TF, the operation is 1/2 * sum(weight * weight). The grad is the same.
Since TF use static computation graph, the Jacobian is straight foward.
I will try to train the model without weight decay.
But in my model, the final evaluation result relies on the classification accuracy.
If the weight decay implementation is the curprit, this is a bad thing. It means that I can’t use pytorch to train the model.

Oli · May 8, 2019, 12:25pm

I don’t remember exactly what I read but I believe it stated that the pytorch way to use weight decay was different from the original paper. I don’t think it was in relation to tensorflow.

You probably know much more than me on this subject

blog post L2 != weight decay

kkpytorch · May 8, 2019, 1:11pm

Thank you.
I have read the blog.
I also checked the SGD implementation in pytorch. The weight decay part is accumulated in the momentum term. This is the same as the L2 regularization implementation in TF.
So, for momentum optimzation method in pytroch, weight decay is the same as L2 regularization.
Personally, I think the only difference is that, in pytorch the weight decay term in the grad is x. In TF, I have not find out the source code on l2_loss(), but basically I think according to the static graph, the grad on L2 loss equals to 0.5x + 0.5x,When the weight is small enough, 0.5x causes underflow and becomes to 0 in float32. But such small weight x barely affects the whole grad. So the optimization method may be not the culprit.

yw_tt · June 27, 2019, 1:05pm

Hello, kk, do you find your problem’s resolution? I have met the same problem. Thanks for your reply.