Audio spectrogram data normalization

I have a dataset of audio files and want to do data normalization on my data in spectrogram space. I read each file and apply data augmentation on the fly.

1- If these augmentations change the statistics, what should we do? What is the way it’s done in vision models? I see channel-wise mean and std which I think data augmentations do not change them, but masking or some others, I think, change it. Is it true?

2- Is normalizing different frequencies and times in the spectrogram using the same scale (instance-based, batch-based or globally) okay? What are the best practices there?

3- Should we normalize based on frequency for example or time?