CUDA 11.7 causing large calculation discrepancy

Was your PyTorch package also updated along with CUDA? I would check if this is a consequence of TF32 usage via cuDNN but not in cuBLAS. For example, you can run your linear implementation with:

torch.set_float32_matmul_precision("high")

and see if it changes the difference or with

 torch.backends.cudnn.allow_tf32 = False

and see if that changes the difference.