Performance issue with pytorch-v1.12.0

Hello,

Recently I’ve tried to install Pytorch from source with CUDA11.7.0 and perform a Bert training.
I found that with torch-v1.11.0, the performance could attain 450 examples/sec, which outspeeds the one trained with torch-v1.12.0, 150 examples/sec.

Below is installation command:
CFLAGS="-g0 -fno-gnu-unique" USE_CUPTI_SO=1 USE_KINETO=1 CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" MAX_JOBS=80 USE_SYSTEM_NCCL=1 CUDA_HOME=/usr/local/cuda python setup.py install

Is there anyone who could help me solving this issue?

Which GPU are you using? If it’s from the Ampere family, you might want to re-enable TF32 via torch.backends.cuda.matmul.allow_tf32 = True or via torch.set_float32_matmul_precision.

Thanks a lot, problem solved.

But torch.set_float32_matmul_precision is not working.

BTW, Is there any reason that in the latest Pytorch version it disables TF32 by default?

Yes, you can read more about it here.