Suboptimal convergence when compared with TensorFlow model

I have had similar issues with Pytorch vs Keras, but while I haven’t found a simple answer, these are other things I would check:

  • Is Keras using any regularizers or constraints?
  • Is Keras using biases whilst PyTorch is not?
  • Are you computing the loss the exact same way?