Optimizing network architecture

Hi, I was benchmarking different convolutional layers on CPU. It seems that less output channels does not mean faster runtime. This is my test:

def benchmark(model, inp_size=(1,1,28,28), n=1000):
    model.eval()
    dummy_input = torch.rand(*inp_size)

    func = lambda: model(dummy_input)
    runtimes = timeit.repeat(func, repeat=10, number=n, )
    print(model)
    print(min(runtimes))
    print()

The result:

Conv2d(1, 64, kernel_size=(7, 7), stride=(1, 1), padding=(1, 1))
0.13669140800000035

Conv2d(1, 61, kernel_size=(7, 7), stride=(1, 1), padding=(1, 1))
0.19433658299999967

Conv2d(1, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
0.1496824219999997

Conv2d(1, 125, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
0.2669210919999996

Conv2d(1, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
0.501409099

Conv2d(1, 509, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
1.0796660310000021

Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
0.666838663

Conv2d(61, 61, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
0.8703990200000007

Conv2d(33, 33, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
0.5053342729999954

Questions:

  1. Is my benchmarking correct?
  2. Why is using less channels not faster?
  3. Would a network compilers like glow accelerate layers with fewer channels?