Should the bias in a Linear layer be considered when estimating FLOP?

I am a bit confused on how the calculation of floating point operations within a neural network is done. This is somewhat a “well-established” topic, however I need some clarifications on the proper/precise way – if we can say so – for determining the FLOP[1] of a Linear layer with and without the bias term. I have seen a similar post from quite some time, however it has no clear answer.

Let’s first establish the fundamentals:

  • when passing some input A through a nn.Linear layer, we are basically performing GEMMs. If one assumes the notations from NVIDIA guidelines, the input matrix A can have a shape (N, K) (assuming here that N will play the role of the batch size and K is the feature dimension.
  • Furthermore, the linear layer will be described by its weight W of shape (M, K) where M defines the output feature dimension.
  • When the bias term is included in the linear layer, its shape will just match the output dimension M.

Ignoring bias term for now, the product AW^T between the weight matrix W and the input matrix Awill be considered as M x N x K fused-multiply adds (or FMA for short). Finally, one single FMA consists of one multiplication and one addition, which result in a total of two FLOPs per single FMA[2].

Thus, for a single linear layer with no bias, there are 2 x M x N x K FLOPs. However, if there is a bias term, then how should one correctly estimate the total FLOPs?
My intuition tells me that we should simply consider one extra addition operation, since after AW^T, one needs to perform AW^T + b.

I have been trying to test this using PyTorch built-in profiling with a toy model (I have denoted the “Multiply-Add Accumulate” with MAC throughout the logs).

Code

class Model(nn.Module):
    def __init__(self, in_dim: int, out_dim: int, use_bias: bool):
        super(Model, self).__init__()
        self.linear = nn.Linear(in_dim, out_dim, bias=use_bias)

    def forward(self, x: torch.Tensor):
        x = self.linear(x)
        return x


def test_mac(use_bias: bool):
    print(f'Testing MAC for a Linear layer with bias={use_bias}')
    device = "cpu"

    N, K = 5, 10  # input shape: usually batch size and feature size
    M = 1  # output size
    A = torch.randn(N, K).to(device)
    model = Model(K, M, use_bias).to(device)

    print(f'A: (N , K) -> {A.shape}')
    print(f'W: (M, K) -> {model.linear.weight.shape}')
    if use_bias:
        print(f'b: (M,) -> {model.linear.bias.shape}')

    with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
        with record_function("model_inference"):
            model(A)

    print(prof.key_averages(group_by_input_shape=True).table(
        sort_by="cpu_time_total", row_limit=100))

First, without bias, we can see the breakdown:

Testing MAC for a Linear layer with bias=False
A: (N , K) -> torch.Size([5, 10])
W: (M, K) -> torch.Size([1, 10])
----------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------  
                  Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls                Input Shapes  
----------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------  
       model_inference        42.15%      72.438us       100.00%     171.841us     171.841us             1                          []  
          aten::linear         6.36%      10.921us        57.85%      99.403us      99.403us             1      [[5, 10], [1, 10], []]  
          aten::matmul         2.43%       4.168us        29.81%      51.222us      51.222us             1          [[5, 10], [10, 1]]  
              aten::mm        26.53%      45.594us        27.38%      47.054us      47.054us             1          [[5, 10], [10, 1]]  
               aten::t        13.07%      22.465us        21.68%      37.260us      37.260us             1                   [[1, 10]]  
       aten::transpose         5.36%       9.211us         8.61%      14.795us      14.795us             1           [[1, 10], [], []]  
      aten::as_strided         3.25%       5.584us         3.25%       5.584us       5.584us             1       [[1, 10], [], [], []]  
    aten::resolve_conj         0.75%       1.293us         0.75%       1.293us       1.293us             1                    [[5, 1]]  
    aten::resolve_conj         0.10%       0.167us         0.10%       0.167us       0.167us             1                   [[10, 1]]  
----------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------  

which will ultimately perform the “non broadcast-able” version of matrix multiplication torch.mm. On the other hand, when we include the bias, the operation is now different:

Testing MAC for a Linear layer with bias=True
A: (N , K) -> torch.Size([5, 10])
W: (M, K) -> torch.Size([1, 10])
b: (M,) -> torch.Size([1])
----------------------  ------------  ------------  ------------  ------------  ------------  ------------  -----------------------------------  
                  Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls                         Input Shapes  
----------------------  ------------  ------------  ------------  ------------  ------------  ------------  -----------------------------------  
       model_inference        35.30%      74.750us       100.00%     211.750us     211.750us             1                                   []  
          aten::linear         5.53%      11.708us        64.70%     137.000us     137.000us             1              [[5, 10], [1, 10], [1]]  
           aten::addmm        31.39%      66.458us        40.00%      84.708us      84.708us             1      [[1], [5, 10], [10, 1], [], []]  
               aten::t         9.54%      20.209us        19.17%      40.584us      40.584us             1                            [[1, 10]]  
       aten::transpose         7.04%      14.917us         9.62%      20.375us      20.375us             1                    [[1, 10], [], []]  
           aten::copy_         6.93%      14.667us         6.93%      14.667us      14.667us             1                 [[5, 1], [5, 1], []]  
      aten::as_strided         2.58%       5.458us         2.58%       5.458us       5.458us             1                [[1, 10], [], [], []]  
          aten::expand         0.75%       1.583us         1.12%       2.375us       2.375us             1                        [[1], [], []]  
    aten::resolve_conj         0.51%       1.083us         0.51%       1.083us       1.083us             1                             [[5, 1]]  
      aten::as_strided         0.37%       0.792us         0.37%       0.792us       0.792us             1                    [[1], [], [], []]  
    aten::resolve_conj         0.06%       0.125us         0.06%       0.125us       0.125us             1                            [[10, 1]]  
----------------------  ------------  ------------  ------------  ------------  ------------  ------------  -----------------------------------  

Indeed, instead of torch.mm, the torch.addmm operation is performed.

It is unclear for me if addmm has still the same 2 x M x N x K number of FLOPs or not. Should it be 2 x (M +1) x N x K2 x (M x N x K + M x N) ?

Thanks in advance!


  1. Here we should also make the distinction between FLOPs, which denotes the plural of FLOP and FLOPS, which signifies the total operations per second ↩︎

  2. ↩︎

You are pinpointing in the right direction.
To begin with, I don’t think there is a “right” answer. People will bend the definition towards what’s most relevant for them. You can see now Nvidia talking about AITOPs whatever it means.

Now being rigurous, the most common definition of FLOP n_arithmetic_ops/time.
Therefore yes, a bias should contribute to the FLOP counter, but it’s often neglegible.
As you described, the # flops for a squared matrix is 2N**3. For like n=100 (which is not a very large matrix), we are talking about 1.000.000 of FLOPs. Sure you don’t care whether it is 10^6 or 1.000.100 considering a bias. Even for n=10 it’s a similar situation.

Generally speaking, what you observe between torch.mm and torch.addmm is a different implementation of the solution. Under the hood, pytorch may need to do several operations to give you the right abstraction. For example, if you call an LSTM layer with batch_first=True or False, you will see a huge difference in benchmarking due to ops that need to occur under the hood.

Also, think that, libraries may execute ops in parallel or use certain optimizations. Generally speaking, FLOPs do not need to be proportional to run time. For example, you can have an algorithm with computational cost O(n^2) that is parallelizable and another one with O(n) that is sequential and the first run simply faster if you have enough compute.

FLOPs make sense (and still is not a hard rule), when you have two algorithms that operate with the same precision and produce the same result (and still you could have one that is parallelizable and just finish faster despite same FLOPs).

So in short, FLOPs are a bullshit metric unless you compare pears to pears. You will see that many papers do not report FLOPs but training time or steps.

If you need proper comparison, there is nothing better than profiling with a fair setup.

The bias is just + M x N.

There is only one operation, addition. And the bias is a vector of size M.

Thanks for you answer!

I agree that when comparing performance of different methods/theorems for deep learning it is more instructive to measure the training times (per batch, per epoch, etc).

Thank you for the clarification!