`torch.nn.functional.conv2d` 8 times slower in torch 2.5.1 compared to 2.3.1

Hey everyone, I am noticing that in torch 2.5.1 when I run the following snippet the exact same operation takes around 8 times slower compared to torch 2.3.1 when everything else in my environment remains the same:

import torch
import torch.nn.functional as F
import time
print(torch.__version__)
# prints the following
# 2.5.1+cu124

print("torch.backends.cudnn.deterministic = ", torch.backends.cudnn.deterministic)
print("torch.backends.cuda.matmul.allow_tf32 = ", torch.backends.cuda.matmul.allow_tf32)
print("torch.backends.cudnn.allow_tf32 = ", torch.backends.cudnn.allow_tf32)
print("torch.backends.cuda.matmul.allow_bf16_reduced_precision_reduction = ", torch.backends.cuda.matmul.allow_bf16_reduced_precision_reduction)
# prints the following
# torch.backends.cudnn.deterministic =  False
# torch.backends.cuda.matmul.allow_tf32 =  False
# torch.backends.cudnn.allow_tf32 =  True
# torch.backends.cuda.matmul.allow_bf16_reduced_precision_reduction =  True


torch.cuda.synchronize()
myconv_start = time.time()
with torch.no_grad():
    out = F.conv2d(torch.rand(75, 1, 572, 572).cuda(), torch.rand(1, 1, 51, 51).cuda(), padding=0).cpu()
torch.cuda.synchronize()
myconv_time = time.time() - myconv_start
print(f"myconv_time = {myconv_time:0.2f} seconds")
out.shape
# prints the following
# myconv_time = 4.89 seconds
# torch.Size([75, 1, 522, 522])

My CUDA is as follows: Driver Version: 550.90.07, CUDA: 12.4 and I am using a L4 GPU.

The exact same code runs in 0.65 second on average when I run it in torch 2.3.1+cu121 version in the same exact environment.

Resolved it by upgrading to a higher cuDNN. Reference here: `torch.nn.functional.conv2d` 8 times slower in torch 2.5.1 compared to 2.3.1 · Issue #146096 · pytorch/pytorch · GitHub