Implementing low rank approximation

I was implementing low-rank approximation for reducing the number of filters (https://arxiv.org/pdf/1405.3866.pdf).

below is simple code to test the performance

part 1

import torch
import time
import numpy as np

img = torch.autograd.Variable(torch.rand(1, 3, 416, 416))
filters = torch.autograd.Variable(torch.rand(64, 3, 5, 5))

avg = 0
for x in xrange(10):
	num_ops = 0
	st = time.clock()
	conv = torch.nn.functional.conv2d(img, filters, padding=1)
	num_ops = np.prod(conv.size())*np.prod(filters.size())
	# print result.size()
	avg += time.clock() - st
print avg/10.0, 'Average conv operation time'
print 'Number of operation', num_ops

Part 2

print "================================================================="
filters = torch.autograd.Variable(torch.rand(8, 3, 5, 5))
A = torch.autograd.Variable(torch.rand(64, 8))
avg = 0
core_avg = 0
for x in xrange(10):
	num_ops = 0
	st = time.clock()

	core_st = time.clock()
	conv = torch.nn.functional.conv2d(img, filters, padding=1)
	num_ops = np.prod(conv.size())*np.prod(filters.size())
	core_avg += time.clock() - core_st

	conv = conv.view(8, -1)
	core_st = time.clock()
	num_ops += np.prod(conv.size()) * A.size()[0]
	conv = torch.mm(A, conv)
	core_avg += time.clock() - core_st

	# print conv.view(result.size()).size()
	avg += time.clock() - st
print 'Number of reduced operation', num_ops
print avg/10.0, 'Average reduced operation time'
print core_avg/10.0, 'Average reduced core operation time'

output
0.0859299 Average conv operation time
Number of operation 52652851200
=================================================================
Number of reduced operation 910455552
0.0904023 Average reduced operation time
0.0897171 Average reduced core operation time

part 1 does the simple convolution and part 2 does low-rank approximation.

I am seeing that number of operation is less for low-rank approximation, but there is no reduction in time take. What is wrong in my implementation?
Is conv2d is well optimized than matrix multiplication?

The number of operations you computed is for the naive convolution algorithm, but there’s been a lot of research in this area, and modern algorithms perform much less operations. Additionally, the actual speed depends not only on floating point operations, but also on memory bandwidth. Doing conv + mm requires reloading some intermediate values multiple times, and this takes time. A single conv kernel can reuse values already loaded int registers.