How do you measure the degree of influence of pre-training on a fine-tuning?


I have a randomly initialized (deep learning) model A, and a pre-trained model Ap.
The two are trained (or fine-tuned, more specifically for Ap) on the same target task* and I call them A’ and Ap’, respectively.

p(.) is the performance of a model on the target task. Let’s admit that, in general, p(Ap’) > p(A’).

*Assuming that either with everything identical (seeds, etc), or that I have enough seeds so I can rely on the average behavior.

What I am looking for

A measure/metric of the robustness/dependency of my method with respect to the pre-training; i.e. something that tells me the degree of dependency of my method on the pre-training process.

In the end I want to repeat this process over many fine-tunning methods and compare them.

Edit 1

I would like have some conclusion that looks like: "the fine-tunning method X is more robust to pre-training than fine-tunning method Y based on the fact that even with a worse* pre-training it retains more performance.

A bit more formally :

  • say that the training (or, again, fine-tuning) methods X and Y are being compared;
  • their respective performances after they are applied to the (raw, pre-trained) models are (p(Ax), p(Apx)) and (p(Ay), p(Apy));

I’d like to make conclusions, based on comparisons between (p(Ax), p(Apx)) and (p(Ay), p(Apy)), that X is more/less robust to the pre-training than Y.

Ideally… given a method X, I’d like to generate a sequence (p(Ax), p(Ap1x), ..., p(ApNx)) such that the pre-trained models Ap1, ..., ApN are increasingly better in their original (surrogate) task – and A (without pre-training) is a reference. Then the sequence of performance on the target task (p(Ax), p(Ap1x), ..., p(ApNx)) allows one to take conclusions about X [compared to another fine-tuning mehtod Y].

A practical (optional) constraint: for the sake of reproducibility, I’d like to use publicly availabe pre-trained models so in a perfect scenario all the model Ap1, ..., ApN should be available or “easily” reproducible (i.e. can be programatically generated).


In the ideal analysis, you would

  • repeatedly randomly initialize your model,
  • do the pretraining on each,
  • train both the “raw” randomly initialized and the pre-trained model on your data and get performance metrics.

This would give you a random sample of the performance metrics with and without pretraining, so you can take expectation and compute a bootstrapped confidence interval. (It might be interesting to also variate on the train/val split, but maybe this is optional.)

This costs quite a bit of computational power. The good news is that if your multiple fine-tuning methods start from the same type of pretrained model, you can amortize the pre-training cost over your experiments.

Best regards


1 Like

Thank you for your reply @tom.
I added a few details to the question after you wrote.

I first thought of something close to what you said, but what I’m concerned about the “A practical (optional) constraint” (above).

Are you aware of where one could find multiple pre-trained weights which were trained with the same procedure but different seeds?

I am not sure there are, you could try to see if @rwightman did something in the timm context - he is probably one of the people training the most models.
Both Ross (in timm) and TorchVision publish their training methods, so if you can afford the compute, you can make your own.

Best regards