Keras training speed issue (PyTorch is a lot slower than TensorFlow)

This might not be a PyTorch issue because I am using Keras but I have no idea what could be the problem and maybe someone can steer me towards the right direction.

I am on native Windows and I used old Keras with TensorFlow 2.10 (GPU accelerated) before. I wanted to try Keras 3 with PyTorch backend. Can someone please help me why this model trains 10x slower with Keras 3.4.1 and PyTorch 2.3.1 backend? With my GPU a single epoch takes a little more than 2 minutes with TF, and over 20 minutes with PyTorch.

import os
os.environ["KERAS_BACKEND"] = "torch"
import torch
torch.cuda.is_available() # <-- returns True

import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras import optimizers
from keras.regularizers import l2

x_train, y_train = np.float32(x_train), np.float32(y_train)
x_val, y_val = np.float32(x_val), np.float32(y_val)

model=Sequential()
reg=0.00001
model.add(LSTM( 80, return_sequences=True , dropout=0.0, kernel_regularizer=l2(reg), recurrent_regularizer=l2(reg), input_shape=(x_train.shape[1], x_train.shape[2]) ))
model.add(LSTM( 80, return_sequences=False, dropout=0.0, kernel_regularizer=l2(reg), recurrent_regularizer=l2(reg) ))
model.add(Dense(40))
model.add(Dense(40))
model.add(Dense(1))
opt = optimizers.Adam(learning_rate=lrate)
model.compile(optimizer=opt, loss='mean_squared_error')

from keras.callbacks import ModelCheckpoint
from keras.callbacks import BackupAndRestore
savecallback = ModelCheckpoint(basefolder+"/"+modelfile, save_best_only=False, monitor='val_loss', mode='min', verbose=1)
backupcallback = BackupAndRestore(basefolder+"/tmp/backup_"+modelfile)

hist=model.fit(x_train, y_train, validation_data=(x_val, y_val), batch_size=batchsize, epochs=20, callbacks=[savecallback, backupcallback])

If someone would like to test it, you can generate a small sample using this:

x_train=np.random.uniform(-1, 1, size=(1976, 400, 14))
y_train=np.random.uniform(-1, 1, size=(1976, 1, 1))
x_val=np.random.uniform(-1, 1, size=(1137, 400, 14))
y_val=np.random.uniform(-1, 1, size=(1137, 1, 1))

With this small sample one epoch takes ~30 seconds with PyTorch and 2-3 seconds with TF.

I was experimenting with jit_compile=True to force torch.compile to use the “inductor” backend, but I was unable to get it working on Windows.