SqueezeExcitation module design choice

I noticed that torchvision chooses to use Conv2d to perform squeeze and excitation. However, I noticed that using equivalent Linear is much faster than Conv2d. Is there a reason torchvision chooses Conv2d? Or can this be improved? Below is the code I used to time these two.

import torch 
import torch.nn as nn
import time
from torch.backends import cudnn

cudnn.benchmark = True

@torch.no_grad()
def main():
    data = torch.randn(64, 512,1,1).float().to("cuda")
    shape = data.shape
    m1 = nn.Sequential(
        nn.Linear(512, 64),
        nn.ReLU(True),
        nn.Linear(64, 512),
        nn.Sigmoid()
    ).float().to("cuda")
    m2 = nn.Sequential(
        nn.Conv2d(512, 64, 1),
        nn.ReLU(True),
        nn.Conv2d(64, 512, 1),
        nn.Sigmoid()
    ).float().to("cuda")
    m1[0].weight.data = m2[0].weight.data.squeeze()
    m1[0].bias.data = m2[0].bias.data
    m1[2].weight.data = m2[2].weight.data.squeeze()
    m1[2].bias.data = m2[2].bias.data
    res1 = data.squeeze()
    res1 = m1(res1)
    res1 = res1.reshape(shape)
    res2 = m2(data)
    print(torch.abs(res1 - res2).sum())

    for _ in range(2):  # warmup
        data = data.squeeze()
        data = m1(data)
        data = data.reshape(shape)
    torch.cuda.synchronize()
    t = time.perf_counter()
    for _ in range(10000):
        data = data.squeeze()
        data = m1(data)
        data = data.reshape(shape)
    torch.cuda.synchronize()
    print(time.perf_counter() - t)

    for _ in range(2):  # warmup
        data = m2(data)
    torch.cuda.synchronize()
    t = time.perf_counter()
    for _ in range(10000):
        data = m2(data)
    torch.cuda.synchronize()
    print(time.perf_counter() - t)

if __name__ == "__main__":
    main()

output:

tensor(0.0007, device=‘cuda:0’) # total absolute difference, due to float calculation
2.3467748999828473
3.0033348000142723

As shown above, Linear can be much faster.

I don’t see a reason why a conv should be slower (in cuDNN using their benchmark), as it should also be smart enough to call a matmul internally (if this would give a faster kernel).
I’ll check if the observed slowdown is due to a wrong kernel selection or a lack of proper kernels.

1 Like

Actually, cuDNN benchmark has already improved Conv speed quite a lot. It takes 3.9 seconds before adding cuDNN benchmark. I observed similar time differences on two different machines with different CUDA versions and PyTorch versions, so it is not by chance.

Yeah, I was able to confirm the different kernel selection in the default (legacy) cuDNN api.
If you are using the nightly binaries with a new cuDNN version (I would recommend to check the nightly binary with CUDA 11.6) you could set this runtime env variable to enable the cuDNN v8 API: TORCH_CUDNN_V8_API_ENABLED=1.
Using it, I’m able to close the performance gap almost entirely and am seeing similar kernels in both workloads.

Thanks, I guess I will wait till everything gets stable and officially released.

1 Like