Help with training performances

Hello, newbie here.
I am fighting against my new setup:AMD 9950X, MB is Rog Crosshair X870E, 192GB ram, 2 x samsung 990pro in software raid 1 under linux ubuntu server 24.04LTS. Tried with RTX4090 at first, bought a RTX5090 now and using it.
Training with Pytorch is 1 to 6 it/s with 5090. I reached 9it/s in the past days with RTX4090 but I don’t know how.
Can you please help finding the bottleneck?

python -c “import torch; print(torch.version); print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0))”

2.8.0.dev20250322+cu128
True
NVIDIA GeForce RTX 5090

python --version
Python 3.12.3

Thanks for any feedback you may have.
All the best

I would recommend profiling your code to narrow down where the performance bottleneck is.