Conv2d Much Faster than Mul on GPU. Why is this?

I am timing the speed of a 2d convolution VS an element wise multiplication in pytorch. From my timings the conv2d operation is significantly faster than a simple element wise multiplication.

Conv Time: 0.009611368179321289

Mult Time: 0.5085432529449463

This doesn’t make much sense to me as convolution is O(N^2 k^2) where k is the filter size VS Element wise multiplication of O(N^2)

Is there some reason for this?

Edit: It would appear I need torch.cuda.synchronize() before each call.

After this the timings make sense.

Conv Time: 0.010584831237792969
Mult Time: 0.008290290832519531

import torch
import numpy as np
import time
from torch.nn import functional as F

torchType = torch.FloatTensor
inC = 32
imgH = 64
imgW = 64
outC = 32
filtSize = 3
imgs = np.random.normal(size=(100,inC,imgH,imgW))
imgs = torch.from_numpy(imgs).type(torch.float32).cuda()
filts = torch.randn((inC,outC,filtSize,filtSize)).cuda()

numIters = 500

# Warm up for GPU
for i in range(10):
    cLayer = F.conv2d(imgs, filts,bias=None,padding=1)
for i in range(10):
    c = torch.mul(imgs,imgs)

# Actual Timing
st = time.time()
for i in range(numIters):
    cLayer = F.conv2d(imgs, filts, bias=None, padding=1)
et = time.time()

print("Conv Time: {}".format(et-st))

st = time.time()
for i in range(numIters):
    c = torch.mul(imgs, imgs)
et = time.time()

print("Mult Time: {}".format(et-st))