I have many .mp3 audio recordings of people saying the same sentence. These people have different vocal ranges. So my question is:
In the data preprocessing stage, is it necessary to “normalize” the audio samples to be in one vocal range? If so, how would I go about doing this in PyTorch? I couldn’t find anything relevant in the documentation other than the method
torchaudio.sox_effects.apply_effects_tensor(). But I don’t know how to use it properly. I am completely new to PyTorch. Any help would be great.
I am also a newbie in Pytorch, and I’ve been facing the same questions.
The best I could find is to apply a sequence of transforms while loading the data with my Dataset class, in this case
EgfxDataset. Assuming that in your Dataset
__init__ you have included the transforms:
data_dir (str): Path to the data directory
transforms (list): List of transforms to apply to the audio samples
torch.utils.data.Dataset: Dataset object containing tuples of dry and wet audio samples
self.data_dir = Path(data_dir) / 'egfxset'
self.transforms = transforms
You can define the transforms to reduce dynamic range, correct DC and peak normalize between -1.0 and +1.0:
def contrast (tensor):
return torchaudio.functional.contrast(tensor, 50)
return tensor - torch.mean(tensor)
tensor /= torch.max(torch.abs(tensor))
# I create a list of the transforms I want to apply:
TRANSFORMS = [contrast, correct_dc_offset, peak_normalize]
# When you instanciate your Dataset:
dataset = EgfxDataset(data_dir=data_dir, transforms=TRANSFORMS)
If anyone has more tools to suggest or corrections to point out I am listening.
torchaudio.load, the waveform data is already normalized between -1.0 and 1.0.
From there you could apply further normalization, if you wish. Or you could just let the model handle it. In image processing, it is actually ideal to augment the data with various transforms in order to make the model more robust to various levels of tone, contrast, brightness, discoloration, etc.
You could apply a BatchNorm1d on the inputs. For example:
self.norm = nn.BatchNorm1d(1)
def forward(self, x):
x = self.norm(x)
Additionally, if you have varying sample rates, you may want to consider conditionally adding an upsampler so they are all uniform sample rate.
Thanks a lot for pointing that out!
Adding that line from the
audio_io_tutorial to the documentation would clarify a lot.
BatchNorm1d is very useful, but in some cases you can’t use large batch sizes and you want to reduce the global dynamic range of a dataset. The contrast transform helps a bit, working as a broadband compressor, another solution might be loudness normalization. There are packages that allow loudness normalization on
NumPy arrays, but I have not yet had the opportunity to test whether it would significantly increase loading time. If there is a possibility to apply loudness normalization directly on tensors, I would be very interested in trying it.