Actually, since my input size is fixed, based on What does torch.backends.cudnn.benchmark do? I am using
torch.backends.cudnn.benchmark = True
I tried both and bechmark=True gives very slightly faster times than
torch.backends.cudnn.deterministic = True