Speed drop with dilated conv on different GPUs

Hi, I’m experiencing a huge speed drop when running the exact same code on pascal cards(like 1080, or titanXP, or P100) and more latest cards(titanV or 2080tis), both running on pytorch1.5, cuda10.2 and cudnn7.6.5 shipped with pytorch.

After looking into the code, I found out the problem is about the dilated conv in my network.

I’m running a task similar to image segmentation and I’m using a resnet2d as my network. Here for comparison, I turned off the shuffle on my dataloaders, so input size should be the exact same in all my benchmarks.
When I set all dilations to 1 in all my network, the speed is quite similar on different cards, around 30~35 seconds for 100 iterations on both pascals cards and latest cards.

However, when I turn dilation on, speed would drop significantly on pascal cards, around 140 seconds for 100 iterations, while on titanV, it’s still around 40 seconds for 100 iterations.

In short, speed is similar when I turn dilation off, and speed on pascal cards are much slower (3 to 4 times )when I turn dilation on.

I found a similar topic here, discussing about the backend cudnn here.

However, as my input size would change, setting torch.backends.cudnn.benchmark to True just slow my speed down. Any help here?

Could you disable cudnn via torch.backends.cudnn.enabled = False and rerun the check?
If the performance numbers are closer to each other, then most likely cudnn’s default algorithm for the Pascal architecture is slower than for Volta.
How different are your input shapes? Are you getting new shapes in each iteration or only in the first couple of iterations?

Hi, thanks for your help! I disabled cudnn by setting cudnn.enable to False in my experiments.
I also used a shallow resnet here, because disabling the flag seems to raise the memory usage.
Now the speed are quite in similar in 1080tis and RTX2080tis, both around 24s~30s for 100 iterations.

My task is similar to image segmentation, the input here is a Lx20 (20 is the feature) tensor and outputs a L*L 0/1 segmentation map. L ranges from 50 to up to 400. Batch size is set to 1, and I’m getting new shapes in every iteration. ehhh I don’t think resizing the input to a fixed L looks reasonable in my task.