Possible Issue with batch norm train/eval modes

From my understanding it sounds like datasetB is changing your running estimates in the BatchNorm layers such that datasetA performs bad.

Could you try to calculate the mean and std of both datasets using this approach? It would be interesting to see, if they are so different.