Possible Issue with batch norm train/eval modes

Let me try to summarize the issue to check if I understood the problem completely.
You are pretraining on the synthetic datasetA and achieve a very good performance on A (95%) and around 50% accuracy for B during evaluation of both datasets.

Then you are using your train method to finetune your model on this task. While your model still achieves a good performance on datasetA if the model is in train mode, it drops significantly to 7% if you switch your model to eval. This procedure repeats for each epoch. Is this a correct understanding?

Are you shuffling both datasets? How large are your batch sizes in train for both datasets?
Have you tried to set the model to eval before passing datasetB in train to the model?
Also, have you checked the stats of both datasets, i.e. are they preprocessed in the same manner? Are the mean and std values completely different for both datasets?