From my understanding it sounds like datasetB is changing your running estimates in the BatchNorm
layers such that datasetA performs bad.
Could you try to calculate the mean and std of both datasets using this approach? It would be interesting to see, if they are so different.