Significant differences between Keras/TensorFlow and Torch

Hi all,

After several years of applying Deep Learning using Keras/TensorFlow, I recently tried to convert a rather simple image classification task from TensorFlow/Keras to PyTorch/Lightning. Basically, everything works, however Torch is not hitting the same accuracy as Keras does. After spending about two weeks of comparing and analyzing - mostly based on topics I found here - without resolving the issue, I decided to ask here. I’m trying to explain my application step by step, without missing anything important as well as skipping all trivial stuff.

In Keras, the images are loaded / processed using a custom class derived from tf.keras.utils.Sequence. In Torch, I created an equivalent torch.utils.data.Dataset class that is passed to a torch.utils.data.DataLoader instance. The batches are loaded in both ways using the same code (wrapped in the respective class) such that data loading and processing is the same. Additionally, correct loading of the images was verified by showing the images after loading. Before forwarding by the model, they are scaled to [0,1] by a mutliplication with 1/255 in either case.

The model used is a simple CNN consisting only of Conv2D, ReLU, MaxPool and fully connected layers. The conversion is relatively straightforward:

tf.keras.layers.Conv2D(channels, kernel_size, activation=tf.relu),
tf.keras.layers.MaxPool2D((2,2)),

translates to

nn.Conv2d(channels_in, channels_out, kernel_size),
nn.ReLU(),
nn.MaxPool2d((2,2))

and similarly after flatten the output after the convolution part,

tf.keras.layers.Dense(size, activation=tf.nn.relu)

translates to

nn.Linear(size_in, size_out),
nn.ReLU()

both used with a tf.keras.models.Sequential and nn.Sequential model, respectively.
Verifying both models using model.summary() and torchinfo.summary() shows the same layer structure, i.e. same shapes with an identical number of parameters.

Here, I noticed two significant differences between Keras and Torch that were a bit tricky: In Keras, the final softmax classification layer is included in the model and the loss computation, whereas in Torch, the loss computation expects unsoftmaxed logits. Additionally, the layer weights and biases are initialized differently. Keras uses zeros for the biases and Xavier uniform for the weights, the Torch equivalents are

torch.nn.init.zeros_(layer.bias)
torch.nn.init.xavier_uniform_(layer.weight)

applied to all layers that contain parameters (i.e. convolution and linear).

In Keras, the model was compiled (model.compile()) using a default Adam optimizer and trained using model.fit().

To do the same in Torch, I’ve implemented a small lightning.LightningModule class. The optimizer is the same, Adam uses eps=1e-8 in Torch be default, changed this to 1e-7 as in Keras, everything else is the same. In particular, learning rate is the same in both settings (0.001), the same holds for the batch size. Passing this configuration to a lightning.Trainer() yields pretty much the same training functionality as Keras.

The image data used consitis of ~100k images, randomly stratified partitionied using sklearn’s train_test_split into 80% training / 20% validation data. In Keras, this yields the following losses and accuracies during the first ten epochs (i.e. running loss and accuracy during each epoch)

loss: 0.1162 - accuracy: 0.9580
loss: 0.0507 - accuracy: 0.9823
loss: 0.0402 - accuracy: 0.9858
loss: 0.0345 - accuracy: 0.9880
loss: 0.0304 - accuracy: 0.9891
loss: 0.0276 - accuracy: 0.9904
loss: 0.0251 - accuracy: 0.9909
loss: 0.0242 - accuracy: 0.9914
loss: 0.0229 - accuracy: 0.9917
loss: 0.0213 - accuracy: 0.9924

whereas Torch yields (step outputs removed):

[acc=0.934, loss=0.134]
[acc=0.975, loss=0.050]
[acc=0.981, loss=0.0386]
[acc=0.982, loss=0.0332]
[acc=0.984, loss=0.0299]
[acc=0.985, loss=0.0285]
[acc=0.986, loss=0.0262]
[acc=0.987, loss=0.024]
[acc=0.987, loss=0.0236]
[acc=0.988, loss=0.0225]

Obviously, the task is solved well in both cases, but Keras yields about 0.5% higher accuracy. The same holds for the validation error after 20-40 epochs. Keras yields ~99.5 % validation accuracy, Torch ~99 % “only”. I mean, 99 % accuracy is still fine, but nevertheless 99.5 % is half the error rate, so more than random. Most importantly, these results are reproducible even with repeated trainings, i.e. randomization effects exist (train / test split, image shuffling during epoch, etc.), still the loss / accuracy results are relatively stable. In particular, Keras keeps outperforming Torch by 0.5 % validation accuracy (99.5 % vs. 99.0 %).

Furthermore, even using Torch’s default initialization (i.e. removing explicit Xavier uniform and zeros initialization of weights and biases) does not change anything here. To me, there is something significantly different during training with Keras and Torch. I read that Keras applies an internal learning rate decay during each epoch (cf. Keras learning rate schedules and decay - PyImageSearch) - not to confuse with a learning rate scheduler after each epoch that easily can be implemented in both frameworks - I suspect that this is one of the most significant differences. Or does Torch perform the same learning rate updates during each epoch? And if not, how can this ideally be implemented in Torch? Or are there any other significant differences one should be aware of?

Appreciating any comment and thanks in advance :slight_smile: :+1:

No, PyTorch itself won’t apply learning rate updates behind your back. I’m unsure if Lightning has such a functionality.

I would recommend tryin to load the Keras parameters into the PyTorch model directly (instead of trying to initialize both models in the same way) which would allow you to compare the outputs for the same samples to narrow down where the difference is coming from.

Thanks for your comment! Indeed, using Keras’ weights in Torch is a good idea. I tested this, however only the first epoch’s training accuracy increases to the same level as Keras. In particular, Keras yields about 95-96 % running training accuracy during the first epoch, whereas Torch only yields about 93.0-93.5 %. Copying Keras initialization weights, Torch yields 95 % training accuracy after the first epoch. However, the remaining accuracies are comparable to using Torch’s initialization. Thus, Keras seems to better initialize the weights than Torch (even though they should be equivalent…?) but this only boosts the first epoch’s accuracy, while the convergence stays pretty much the same. Similarly, Torch’ yields ~99% validation accuracy after 20 epochs.

This brings me back to my previous guess, in particular because Torch does not apply it: I need to implement a learning rate decay during each epoch, i.e. somewhere in my lightning module instance. What’s the right or most elegant way to implement this?

I’m unsure what the main difference with this approach to a standard learning rate scheduler would be as you could just call sceduler.step() in each iteration instead each epoch.
Assuming you can easily implement the decay formula in a custom scheduler (or use an available one) I wouldn’t assume to see any special treatment.

Thanks for your comment! I tried to implement this in Torch and while inspecting the parameters Keras uses, I noticed that they are zero. Thus, also Keras should not change the learning rate during the epoch unless this is explicitly enabled by the user. In particular, there has to be a different reason for my issue.

Going one step back to your previous suggestion, I will try to initialize the models in both frameworks the same and eliminate any randomization during training (shuffles, train/test splits, no data augmentation). In this way, the results should be almost identical - same data, same model, same initialization, same optimizer, no randomization - or is there any internal randomization that needs to be controlled as well? I hope this way I can find the difference.

After further analyzing the issue, I found an issue with the accuracy computation in Torch / Lightning. For the moment I don’t know whether this completely resolves my issue. However, it yields the following results:

import torch
from torchmetrics.classification import MulticlassAccuracy
from torchmetrics import Accuracy

acc1 = MulticlassAccuracy(num_classes=4)
acc2 = Accuracy(task='multiclass', num_classes=4)
preds = torch.Tensor([[1, 0, 0, 0], [1, 0, 0, 0]])
targets = torch.Tensor([0, 0])

acc1(preds, targets) # 0.25 -> ???
acc2(preds, targets) # 1    -> correct

For some reason, MulticlassAccuracy computes something wrong here. I don’t want to claim that this is a bug in Torch, maybe the class has to be initialized differently, but at least it is an easy pitfall.

Due to this issue, my previously computed accuracies were incorrect. After using Accuracy instead, the accuracies are significantly improved. Further tests to follow…

Great finding!
By default it seems MulticlassAccuracy will use the average='macro' setting, which will:

  • macro: Calculate statistics for each label and average them

according to the docs.
Also take a look at their example which shows:

from torchmetrics.classification import MulticlassAccuracy
target = torch.tensor([2, 1, 0, 0])
preds = torch.tensor([2, 1, 0, 1])
metric = MulticlassAccuracy(num_classes=3)
metric(preds, target)

mca = MulticlassAccuracy(num_classes=3, average=None)
mca(preds, target)

In this example 3 classes are used where each prediction indicates the class label.
You can also use a floating point tensor representing logits or probabilities. Internally torch.argmax will then be used to create the targets.
Based on your code I assume your preds tensor indicates if a class is “active” or not.
If you pass it as a floating point tensor into the unreduced metric you would get:

acc1 = MulticlassAccuracy(num_classes=4, average=None)
preds = torch.tensor([[1., 0, 0, 0], [1, 0, 0, 0]])
targets = torch.Tensor([0, 0])

acc1(preds, targets)
#  tensor([1., 0., 0., 0.])

which corresponds to a perfect accuracy for class0 and a zero accuracy for all other classes as these are missing.
The macro average will then average the accuracies of all classes yielding 0.25.
If you use micro:

  • micro: Sum statistics over all labels

the result should match your other approach and will return tensor(1.).

At least that’s how I understand the docs.

Indeed, you are right, I checked the documentation, too, but I don’t understand the difference between average='micro' and average='weighted' so far. Presumably, I can supply weights for the latter somewhere, but both alternatives seem to compute the correct accuracy - in contrast to the default 'macro'. However, it is unclear to me why this is used as a default… In particular, since Accuracy(task='multiclass') appears to be equvialent but actually computes something different (in fact the correct value). In my opinion, the default of MulticlassAccuracy should be changed and I find it rather surprising that nobody else ran into this issue before (or maybe people did without really noticing?). Or what do you think about this?

With regard to my issue, the results of Torch and Keras are reasonably close to each other using correct accuracy computation, as expected. I did not finish all my tests yet, still I guess that my issue is mostly resolved. It seems as there never really was a real difference, only my accuracy statistics showed me something wrong.

I don’t know how these default arguments were selected, but note that both metrics are “correct” from their respective perspective. Indeed it might be confusing and you could certainly ask the library authors why the macro average was selected as the default (maybe in their GitHub repository).

Yes agreed, I don’t want to claim it’s a bug as I wrote before, still it is a very easy pitfall. Besides this, so far all my experiments confirmed comparable results between Torch and Keras, i.e. my issue most likely was caused by incorrect accuracy computation. In summary, I think these are the most important aspects while converting Keras code to Torch code, just if someone else might run into similar issues:

  1. Do not add a softmax prediction layer, this is included in the cross entropy loss function.
  2. Use Accuracy(task='multiclass') instead of MulticlassAccuracy() or at least verify that the latter computes the correct accuracy, i.e. use the right 'average' option.
  3. Layer initialization is different: Keras uses Xavier uniform for weights and zero for the biases whereas Torch uses Kaiming uniform.
  4. Default parameters are sometimes slightly different (e.g. Adam optimizer’s eps parameter).

Besides this, converting TensorFlow/Keras to PyTorch/Lightning is relatively straightforward. So thanks for your support :+1: