I am trying to train a diffusion model on a specialized dataset (different from ordinary images), and what concerns me that when comparing/evaluating models in papers (e.g. the original Stable Diffusion paper), they use FID, but in the training code (the validation_step if used lightning), they use the MSE between the prediction of noise and true noise in the validation set to save the weights & biases (checkpoint).
I know that it is computationally expensive to use FID during training as we have to do inference on a large number of samples, but does saving checkpoints according to MSE between noise align with FID? If not, we’re saving the wrong weights & biases? If yes, why don’t we just use the MSE between the noise in comparing/evaluating models in papers instead of FID?