I have had similar issues with Pytorch vs Keras, but while I haven’t found a simple answer, these are other things I would check:
- Is Keras using any regularizers or constraints?
- Is Keras using biases whilst PyTorch is not?
- Are you computing the loss the exact same way?