GPU training speed is slower / same as my CPU

I know this probably was posted a lot of times, I have also tried out some of the solutions as well but I am rather new to these and I am not too sure where it might have gone wrong.
To put short my problem, my CPU takes about 120 min for torch training on 20k x 20 dataset with 4 layers while GPU also takes 130min on the same training.

I have followed the steps as listed in

and also used
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

for downloading the pytorch. At first I downloaded 12.3 version from nvidia instead, and now I have deleted (both with system add/remove and conda uninstall cuda) and re-downloaded to 12.1 version.But I am not too sure why I get the below values when I print them (why 12.3 instead of 12.1?) I can verify my torch.cuda.is_available() returns true.

print(‘__Python VERSION:’, sys.version)
print(‘__pyTorch VERSION:’, torch.version)
print(‘__CUDA VERSION’, )

__Python VERSION: 3.11.8 | packaged by Anaconda, Inc. | (main, Feb 26 2024, 21:34:05) [MSC v.1916 64 bit (AMD64)]
__pyTorch VERSION: 2.2.1
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:30:42_Pacific_Standard_Time_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0

I have also moved the data to cuda with code example below
X_train_tensor_gpu =‘cuda’)
model = Net()‘cuda’)

nvidia-smi returns as below
Wed Mar 6 14:59:19 2024
| NVIDIA-SMI 551.61 Driver Version: 551.61 CUDA Version: 12.4 |
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
| 0 NVIDIA GeForce RTX 4070 … WDDM | 00000000:01:00.0 On | N/A |
| 0% 40C P8 20W / 285W | 1746MiB / 16376MiB | 6% Default |
| | | N/A |

Not too sure if this is how you monitor GPU usage but running the process & checking it in task manager - performance shows my GPU utilisation goes up to 94% so I do think it should be using my GPU plus it’s 4070 Ti Super so I dont see why it would be slower than my CPU.

Anyone can help…? not too sure what other info is needed.