Performance gap between torch 2.2.2 and 1.12.0

Hi, Pytorch’s team.
I am training models on 4 NVIDIA RTX 3090 GPUs. When I tried to upgrade the existing PyTorch version (1.12.0) to a newer version (2.2.2), the memory usage on each GPU increased by approximately 30% (from 16GB to 21GB), and the training speed also slowed down (from 1.5iters/s to 1.2s/iters).
What aspects should I focus on to solve the above problems?

Profile the code to narrow down parts of the code causing the slowdown.