When not to use batch normalisation?

I have a task to predict a disease from ECG of a person. An ecg in raw form is a 2d matrix ( num leads X num seconds). Sometimes the ecg reading is also dependent ( and varies slightly ) according to the machine ( or brand ) used to capture it. GE machine could have different distribution than Philips but very less difference. I have trained a 1-D conv model with residual connections for this. There is 5% performance drop on model trained on GE machine when I predict on Philips ECG.
Could anyone help me with figuring out whether removing Batch Normalisation during training will help in generalising the model so that I get similar performance on other machine types as well?

What you’re describing sounds like a perfect case for batch normalization. You’ve stated:

  • The ECG reading can be brand/machine dependent;
  • ECG is provided in raw form from the machine without any normalization;

What you should do is train the model with data from a wide range of machines and most definitely include batch norm layers.

The best corollary I can give you is how our eyesight works. When you go from a dark room to a bright room, your eyes “adjust” to the new light settings. Likewise when you go from a room with a yellow lighting to a blue lighting or full spectrum lighting, everything looks different until our eyes adjust.

Similarly, batchnorm layers are trained to normalize the incoming data into a range that the convolution layers are most effective in.

During training I have data from only 1 machine but I want to improve performance on other machines.
A doctor when shown an ecg from any machine is able to detect the problem as he can adjust his reading according to machine. The model has performance drop.
Some machines can have little higher peaks and lower troughs. Some machines can have some bias in reading. This is only difference which occurs.
Do you still think I should use batch norm to train?

If you’re only training from the data from one machine, then it might not make a difference. So I’m not really clear on what your question is, then.

The doctors who make those adjustments do so through trial and error. It’s a learning process. However, the outputs of a model during inference time are not going to engage in “trial and error” unless you specifically make a trainable layer and pre-program it to fine-tune after a certain number of data samples have accrued.