Mean std of merged dataset

sakh251 · March 26, 2022, 10:08pm

Hello,

I have a model trained on a merged datasets. I know mean and std of each dataset. Now I want to know how to should normalize new data for this model.
For I think we can calculate the average of means?
But for std since it could depend on covariance of them I don’t have any idea. Datasets two different dataset based on some labeled images.

Thank you.

Andrei_Cristea · March 26, 2022, 10:42pm

Could you normalize the two datasets first, and then merged the normalized datasets?

tom · March 27, 2022, 4:33am

Yes, you can calculate this.

If the relative sizes of the datasets are w_1 and w_2 (i.e. w_1 + w_2 == 1) and the means m_1, m_2, and stds s_1, s_2 (with a tiny bit of inaccurary if they are unbiased std, should not matter if you just merge two datasets, though), you have

mean: m = w_1 m_2 + w_2 m_2

std: s = ( (w_1 (s_1**2 + m_1**2) + w_2 (s_2**2 + m_2**2) - m**2)**0.5

(This computes the uncentered second moments from the stds, takes the weighted average to get that of the entire set, and then computes the std of the entire set from that.)

Best regards

Thomas

sakh251 · March 27, 2022, 6:32pm

@tom thank you very much Thomas
I am a bit confused about how did you compute merged std. what do you mean by s_12 or m_22.
Are these joint std and mean of set 1 and 2 ? Or set 2 with itself?

Best Regards

tom · March 27, 2022, 7:16pm

I had the formatting wrong. Fixed it now, hopefully its more clear now.

Best regards

Thomas