I have many .mp3 audio recordings of people saying the same sentence. These people have different vocal ranges. So my question is:
In the data preprocessing stage, is it necessary to “normalize” the audio samples to be in one vocal range? If so, how would I go about doing this in PyTorch? I couldn’t find anything relevant in the documentation other than the method torchaudio.sox_effects.apply_effects_tensor(). But I don’t know how to use it properly. I am completely new to PyTorch. Any help would be great.
Hi,
I am also a newbie in Pytorch, and I’ve been facing the same questions.
The best I could find is to apply a sequence of transforms while loading the data with my Dataset class, in this case EgfxDataset. Assuming that in your Dataset __init__ you have included the transforms:
class EgfxDataset(Dataset):
"""Egfx dataset
Args:
data_dir (str): Path to the data directory
transforms (list): List of transforms to apply to the audio samples
Returns:
torch.utils.data.Dataset: Dataset object containing tuples of dry and wet audio samples
"""
def __init__(self,
data_dir,
transforms=None
):
self.data_dir = Path(data_dir) / 'egfxset'
self.transforms = transforms
You can define the transforms to reduce dynamic range, correct DC and peak normalize between -1.0 and +1.0:
def contrast (tensor):
return torchaudio.functional.contrast(tensor, 50)
def correct_dc_offset(tensor):
return tensor - torch.mean(tensor)
def peak_normalize(tensor):
tensor /= torch.max(torch.abs(tensor))
return tensor
# I create a list of the transforms I want to apply:
TRANSFORMS = [contrast, correct_dc_offset, peak_normalize]
# When you instanciate your Dataset:
dataset = EgfxDataset(data_dir=data_dir, transforms=TRANSFORMS)
If anyone has more tools to suggest or corrections to point out I am listening.
Thanks
From there you could apply further normalization, if you wish. Or you could just let the model handle it. In image processing, it is actually ideal to augment the data with various transforms in order to make the model more robust to various levels of tone, contrast, brightness, discoloration, etc.
You could apply a BatchNorm1d on the inputs. For example:
Thanks a lot for pointing that out!
Adding that line from the audio_io_tutorial to the documentation would clarify a lot.
BatchNorm1d is very useful, but in some cases you can’t use large batch sizes and you want to reduce the global dynamic range of a dataset. The contrast transform helps a bit, working as a broadband compressor, another solution might be loudness normalization. There are packages that allow loudness normalization on NumPy arrays, but I have not yet had the opportunity to test whether it would significantly increase loading time. If there is a possibility to apply loudness normalization directly on tensors, I would be very interested in trying it.