Thanks so much for your reply!
From my testing, setting torch.backends.cudnn.benchmark = True is the key for accelerating the conv2d with one group and large kernel size. Only the first run of conv2d takes quite long but the following runs are all very fast:
Time taken for conv2d with groups=1, round=0: 13.8164 seconds
Time taken for conv2d with groups=1, round=1: 0.0100 seconds
Time taken for conv2d with groups=1, round=2: 0.0090 seconds
Time taken for conv2d with groups=1, round=3: 0.0090 seconds
Time taken for conv2d with groups=1, round=4: 0.0090 seconds
Time taken for conv2d with groups=1, round=5: 0.0090 seconds
Time taken for conv2d with groups=1, round=6: 0.0090 seconds
Time taken for conv2d with groups=1, round=7: 0.0090 seconds
Time taken for conv2d with groups=1, round=8: 0.0090 seconds
Time taken for conv2d with groups=1, round=9: 0.0090 seconds
And for conv2d with two groups, it seems that the default cudnn backend is already fast enough:
Time taken for conv2d with groups=2, round=0: 0.0200 seconds
Time taken for conv2d with groups=2, round=1: 0.0190 seconds
Time taken for conv2d with groups=2, round=2: 0.0180 seconds
Time taken for conv2d with groups=2, round=3: 0.0190 seconds
Time taken for conv2d with groups=2, round=4: 0.0180 seconds
Time taken for conv2d with groups=2, round=5: 0.0190 seconds
Time taken for conv2d with groups=2, round=6: 0.0180 seconds
Time taken for conv2d with groups=2, round=7: 0.0180 seconds
Time taken for conv2d with groups=2, round=8: 0.0190 seconds
Time taken for conv2d with groups=2, round=9: 0.0180 seconds
May I ask if it is possible to let pytorch remember the best cudnn backend for the case of one group? As I need to do many tests of this kind of conv2d with large kernel but each time I only need a single run.
I can somehow achieve this by using duplicate kernel with two groups but this seems not elegent. Thank you!
The codes I used for the above test, I did two seperate runs for n_frames=1 and n_frames=2 :
import torch
from time import time
torch.backends.cudnn.benchmark = True
device = 'cuda:1'
torch.manual_seed(0)
input_field = torch.rand(1, 1, 1080, 1920, dtype=torch.float32, device=device)
kernel = torch.rand(1, 1, 101, 101, dtype=torch.float32, device=device)
n_frames = 1
input_field = input_field.repeat(1, n_frames, 1, 1) # Repeat the input field for n_frames
kernel = kernel.repeat(n_frames, 1, 1, 1) # Repeat the sub-hologram phase for n_frames
for i in range(10):
torch.cuda.synchronize(device=device)
start_time = time()
ret = torch.nn.functional.conv2d(
input_field,
kernel,
padding='same',
groups=n_frames,
)
torch.cuda.synchronize(device=device)
end_time = time()
print(f"Time taken for conv2d with groups={n_frames}, round={i}: {end_time - start_time:.4f} seconds")