3D CNN models ensemble

Ok, interesting idea.
So as far as I understand your approach, each models uses its mean and std, which were calculated on the positive samples for the appropriate class. Am I right?

Did this approach outperform 6 different models using a global mean and std?

However, you could relocate the standardization into the Dataset returning 6 differently normalized samples.
Through this, you could push some computation into a DataLoader, i.e. CPU, while your model ensemble calculates the predictions.

What is the overall accuracy of the model ensemble compared to the first model (~40% accuracy)?