I am a bit confused on how the calculation of floating point operations within a neural network is done. This is somewhat a “well-established” topic, however I need some clarifications on the proper/precise way – if we can say so – for determining the FLOP[1] of a Linear layer with and without the bias term. I have seen a similar post from quite some time, however it has no clear answer.
Let’s first establish the fundamentals:
- when passing some input
Athrough ann.Linearlayer, we are basically performing GEMMs. If one assumes the notations from NVIDIA guidelines, the input matrixAcan have a shape(N, K)(assuming here thatNwill play the role of the batch size andKis the feature dimension. - Furthermore, the linear layer will be described by its weight
Wof shape(M, K)whereMdefines the output feature dimension. - When the bias term is included in the linear layer, its shape will just match the output dimension
M.
Ignoring bias term for now, the product AW^T between the weight matrix W and the input matrix Awill be considered as M x N x K fused-multiply adds (or FMA for short). Finally, one single FMA consists of one multiplication and one addition, which result in a total of two FLOPs per single FMA[2].
Thus, for a single linear layer with no bias, there are 2 x M x N x K FLOPs. However, if there is a bias term, then how should one correctly estimate the total FLOPs?
My intuition tells me that we should simply consider one extra addition operation, since after AW^T, one needs to perform AW^T + b.
I have been trying to test this using PyTorch built-in profiling with a toy model (I have denoted the “Multiply-Add Accumulate” with MAC throughout the logs).
Code
class Model(nn.Module):
def __init__(self, in_dim: int, out_dim: int, use_bias: bool):
super(Model, self).__init__()
self.linear = nn.Linear(in_dim, out_dim, bias=use_bias)
def forward(self, x: torch.Tensor):
x = self.linear(x)
return x
def test_mac(use_bias: bool):
print(f'Testing MAC for a Linear layer with bias={use_bias}')
device = "cpu"
N, K = 5, 10 # input shape: usually batch size and feature size
M = 1 # output size
A = torch.randn(N, K).to(device)
model = Model(K, M, use_bias).to(device)
print(f'A: (N , K) -> {A.shape}')
print(f'W: (M, K) -> {model.linear.weight.shape}')
if use_bias:
print(f'b: (M,) -> {model.linear.bias.shape}')
with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
with record_function("model_inference"):
model(A)
print(prof.key_averages(group_by_input_shape=True).table(
sort_by="cpu_time_total", row_limit=100))
First, without bias, we can see the breakdown:
Testing MAC for a Linear layer with bias=False
A: (N , K) -> torch.Size([5, 10])
W: (M, K) -> torch.Size([1, 10])
---------------------- ------------ ------------ ------------ ------------ ------------ ------------ --------------------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls Input Shapes
---------------------- ------------ ------------ ------------ ------------ ------------ ------------ --------------------------
model_inference 42.15% 72.438us 100.00% 171.841us 171.841us 1 []
aten::linear 6.36% 10.921us 57.85% 99.403us 99.403us 1 [[5, 10], [1, 10], []]
aten::matmul 2.43% 4.168us 29.81% 51.222us 51.222us 1 [[5, 10], [10, 1]]
aten::mm 26.53% 45.594us 27.38% 47.054us 47.054us 1 [[5, 10], [10, 1]]
aten::t 13.07% 22.465us 21.68% 37.260us 37.260us 1 [[1, 10]]
aten::transpose 5.36% 9.211us 8.61% 14.795us 14.795us 1 [[1, 10], [], []]
aten::as_strided 3.25% 5.584us 3.25% 5.584us 5.584us 1 [[1, 10], [], [], []]
aten::resolve_conj 0.75% 1.293us 0.75% 1.293us 1.293us 1 [[5, 1]]
aten::resolve_conj 0.10% 0.167us 0.10% 0.167us 0.167us 1 [[10, 1]]
---------------------- ------------ ------------ ------------ ------------ ------------ ------------ --------------------------
which will ultimately perform the “non broadcast-able” version of matrix multiplication torch.mm. On the other hand, when we include the bias, the operation is now different:
Testing MAC for a Linear layer with bias=True
A: (N , K) -> torch.Size([5, 10])
W: (M, K) -> torch.Size([1, 10])
b: (M,) -> torch.Size([1])
---------------------- ------------ ------------ ------------ ------------ ------------ ------------ -----------------------------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls Input Shapes
---------------------- ------------ ------------ ------------ ------------ ------------ ------------ -----------------------------------
model_inference 35.30% 74.750us 100.00% 211.750us 211.750us 1 []
aten::linear 5.53% 11.708us 64.70% 137.000us 137.000us 1 [[5, 10], [1, 10], [1]]
aten::addmm 31.39% 66.458us 40.00% 84.708us 84.708us 1 [[1], [5, 10], [10, 1], [], []]
aten::t 9.54% 20.209us 19.17% 40.584us 40.584us 1 [[1, 10]]
aten::transpose 7.04% 14.917us 9.62% 20.375us 20.375us 1 [[1, 10], [], []]
aten::copy_ 6.93% 14.667us 6.93% 14.667us 14.667us 1 [[5, 1], [5, 1], []]
aten::as_strided 2.58% 5.458us 2.58% 5.458us 5.458us 1 [[1, 10], [], [], []]
aten::expand 0.75% 1.583us 1.12% 2.375us 2.375us 1 [[1], [], []]
aten::resolve_conj 0.51% 1.083us 0.51% 1.083us 1.083us 1 [[5, 1]]
aten::as_strided 0.37% 0.792us 0.37% 0.792us 0.792us 1 [[1], [], [], []]
aten::resolve_conj 0.06% 0.125us 0.06% 0.125us 0.125us 1 [[10, 1]]
---------------------- ------------ ------------ ------------ ------------ ------------ ------------ -----------------------------------
Indeed, instead of torch.mm, the torch.addmm operation is performed.
It is unclear for me if addmm has still the same 2 x M x N x K number of FLOPs or not. Should it be → 2 x (M +1) x N x K2 x (M x N x K + M x N) ?
Thanks in advance!