Was your PyTorch package also updated along with CUDA? I would check if this is a consequence of TF32 usage via cuDNN but not in cuBLAS. For example, you can run your linear implementation with:
torch.set_float32_matmul_precision("high")
and see if it changes the difference or with
torch.backends.cudnn.allow_tf32 = False
and see if that changes the difference.