Comparison time complexity between Convolution and Fully connected models

hi
I designed the following model, but the convolution structure (2) has a much smaller number of parameters
But in terms of time complexity, it has more latency than the fully connected Foley model
What is the reason for this ??
If the model has fewer flops, the delay is less ??

How can I show that the convolution model has a much smaller number of flops, so the latency or time complexity is less?

both model has equal latency time for inference . is reasonable ???

model 1 : (Fully connected model)
class modelss(nn.Module):

def __init__(self):

    super(modelss, self).__init__()

    self.channel_num_in = 256

    self.encoder = nn.Sequential(

        nn.Linear(self.channel_num_in, 512),

        nn.Linear(512, 512),

        nn.Linear(512, 256),

        nn.Linear(256, 128),

        nn.Linear(128, 64),

        nn.Linear(64, 32),

        nn.Linear(32, 16),

    )

    self.fc = nn.Linear(16,7)

def forward(self, x):

    x = self.layer(x)

    x = x.view(x.size(0), -1)

    x=self.fc(x)

    return  x 

model = modelss()


    Layer (type)               Output Shape          Params           FLOPs           Madds

=========================================
Linear-1 [2, 512] 131,584 131,072 261,632
Linear-2 [2, 512] 262,656 262,144 523,776
Linear-3 [2, 256] 131,328 131,072 261,888
Linear-4 [2, 128] 32,896 32,768 65,408
Linear-5 [2, 64] 8,256 8,192 16,320
Linear-6 [2, 32] 2,080 2,048 4,064
Linear-7 [2, 16] 528 512 1,008
Linear-8 [2, 7] 119 112 217

Total params: 569,447
Trainable params: 569,447
Non-trainable params: 0
Total FLOPs: 567,920
Total Madds: 1,134,313

Input size (MB): 0.03
Forward/backward pass size (MB): 0.01
Params size (MB): 0.54
Estimated Total Size (MB): 0.58
FLOPs size (GB): 0.00
Madds size (GB): 0.00

=========================================

model 2 : (Conv Model)

class ConvModel(nn.Module):

def __init__(self):

    super(ConvModel, self).__init__()

    self.channel_num_in = 1

    self.layer = nn.Sequential(

        nn.Conv2d(self.channel_num_in, 16,2,2),

        nn.Conv2d(16, 8,2,2),

        nn.Conv2d(8, 4,2,2),

    )

    self.fc = nn.Linear(16,7)

def forward(self, x):

    x = self.layer(x)

    x = x.view(x.size(0), -1)

    x=self.fc(x)

    return  x 

model = ConvModel()


    Layer (type)               Output Shape          Params           FLOPs           Madds

=====================================================
Conv2d-1 [2, 16, 8, 8] 80 5,120 8,192
Conv2d-2 [2, 8, 4, 4] 520 8,320 16,384
Conv2d-3 [2, 4, 2, 2] 132 528 1,024
Linear-4 [2, 7] 119 112 217

Total params: 851
Trainable params: 851
Non-trainable params: 0
Total FLOPs: 14,080
Total Madds: 25,817

Input size (MB): 0.03
Forward/backward pass size (MB): 0.00
Params size (MB): 0.00
Estimated Total Size (MB): 0.04
FLOPs size (GB): 0.00
Madds size (GB): 0.00

==================================

this code used for calculate inference time in models…

#model = EfficientNet.from_pretrained(‘efficientnet-b0’)

device = torch.device(“cuda”)

model.to(device)

dummy_input = torch.randn(128, 1, 16,16,dtype=torch.float).to(device)

starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)

repetitions = 300

timings=np.zeros((repetitions,1))

#GPU-WARM-UP

for _ in range(10):

_ = model(dummy_input)

MEASURE PERFORMANCE

with torch.no_grad():

for rep in range(repetitions):

 starter.record()

 _ = model(dummy_input)

 ender.record()

 # WAIT FOR GPU SYNC

 torch.cuda.synchronize()

 curr_time = starter.elapsed_time(ender)

 timings[rep] = curr_time

mean_syn = np.sum(timings) / repetitions

std_syn = np.std(timings)

print(mean_syn)