Comparing two DL models across various datasets

I’m testing two DL models M1, M2 across datasets D1,D2,D3 and D4.

When using D1 and D4, M2 is showing a better performance but for both D2 and D3 there is no performance improvement.

  • Is this normal ?
  • Or Do I have to change my architecture So M2 should always produce better performance compared to M1 for all datasets ?