Training time difference in transfer learning

Let’s say I is a domain of images and f : I -> E and g : E -> F where E and F are embedding spaces.

There are two cases

  1. freeze f
  2. unfreeze f

Question is : when training g(f(I)), I expected that training time of the case 1 is more faster than that of the case 2. But from my experience, the training time difference is around 1:1.5 even though the number of parameters of f is way larger(100x) than g. I wanna save the time when freezing f. Any suggestion?

If there’s a way to save f(A) for all A in I, and I can only forward/backward propagate them, please let me know. Thanks in advance.

I don’t think parameter count is a good metric for how fast a model should run. You can benchmark both the cases independently and see if the result what you expected.

Take a batch, run through f (time it few times)
Do it for g, and you will get a rough estimate of the time taken for forward, backward cases.

You can store the results as files and then have the dataloader load these instead of the images for g.

Hi Kushaj. Thanks for paying attention to my question.

usually the output of f(A) are really big like (number of images) x channels x height x width = 50000 x 256 x 128 x 128. So saving these all images is not a good idea.

I agree that the parameter count is not a good metric. But the training time matters.