I am trying to estimate how many FLOPS my model is performing in a similar way to how it is done in this paper : https://arxiv.org/pdf/2001.08361.pdf . In the paper they provide an estimate for the number of add-multiply operations specific for transformers and per token as C ≈6N floating point operators per training token where N is the number of parameters excluding the embedding parameters. This number comes from the forward pass requiring ~2N operations and the backward pass roughly twice as much as the forward pass.

I am now training a transformer model where I have frozen all the parameters of the network by setting requires_grad=False. I have then added a small MLP before the input layer to the transformer which processes the input and then feeds it to the transformer input layer. The parameters of this MLP are the only ones that will be tuned. My question is then what effect this has on the number of operations done during the forward and backward pass?

After reading Autograd mechanics — PyTorch 1.10.0 documentation my understanding is that since I need gradients for the input layer (my MLP) the autograd engine would still need to store all intermediate gradients in the network (including the transformer) since the gradients for the MLP depends on these. Am I understanding this correctly? So in this case the number of operations would be C ≈6N where N = N_transformer + N_MLP ? While on the other hand, if I added a MLP after the Transformer for finetuning or whatever and then set required_grad=False on the transformer layers, then we would get 2N (N=N_trans + N_MLP) for the forward pass but only 2*(2*N_MLP) (i.e. twice the operations for the MLP forward pass) for the backward pass?