Convert Torchaudio melspectograms for ResNet networks

Hello,
I am building a Sound classifier and using Torchaudio. The Wav file is loaded and then transformed this way:

mel_spectrogram = torchaudio.transforms.MelSpectrogram(
    sample_rate=SAMPLE_RATE,
    n_fft=1024,
    hop_length=512,
    n_mels=64
)
size=224
transform_spectra = T.Compose([
    mel_spectrogram,
    T.Resize(size),
    T.CenterCrop(size),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406],
                                     std=[0.229, 0.224, 0.225]),
])

I then created a network with ResNet

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        # Use a pretrained model
        self.network = models.resnet34(pretrained=True)
        # Replace last layer
        num_ftrs = self.network.fc.in_features
        self.network.fc = nn.Linear(num_ftrs, 2)
    def forward(self, xb):
        return self.network(xb)
    def freeze(self):
        for param in self.network.parameters():
            param.require_grad = False
        for param in self.network.fc.parameters():
            param.require_grad = True
    def unfreeze(self):
        for param in self.network.parameters():
            param.require_grad = True

and just changed the last FC to only have 2 classes.

However I get this error when I run the training.

/opt/conda/lib/python3.8/site-packages/torchvision/transforms/functional.py in to_tensor(pic)
    112     """
    113     if not(F_pil._is_pil_image(pic) or _is_numpy(pic)):
--> 114         raise TypeError('pic should be PIL Image or ndarray. Got {}'.format(type(pic)))
    115 
    116     if _is_numpy(pic) and not _is_numpy_image(pic):

TypeError: pic should be PIL Image or ndarray. Got <class 'torch.Tensor'>

Can anybody help ?

P.S. Is there some tutorial for training data converted to melspectrograms with a ResNet ? I would think this is a standard problem for sound clarification but not many tutorial out there.

UPDATE:

I modified the transfor and removed the ToTensor

transform_spectra = T.Compose([
    mel_spectrogram,
    T.Resize(size),
    T.CenterCrop(size),
    #T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406],std=[0.229, 0.224, 0.225]),
    T.Normalize((0.5,),(0.5,)),
])

but then I received the following error.

Given groups=1, weight of size [64, 3, 7, 7], expected input[1, 1, 224, 224] to have 3 channels, but got 1 channels instead

I then googled and a bit and then modified the net and changed the input size like below:

class Net(nn.Module):

    def __init__(self):
        super().__init__()
        # Use a pretrained model
        self.network = models.resnet34(pretrained=True)
        # Replace last layer
        self.network.conv1=nn.Conv2d(1, self.network.conv1.out_channels, 
                      kernel_size=self.network.conv1.kernel_size[0], 
                      stride=self.network.conv1.stride[0], 
                      padding=self.network.conv1.padding[0])
        num_ftrs = self.network.fc.in_features
        self.network.fc = nn.Linear(num_ftrs, 2)
    def forward(self, xb):
        return self.network(xb)
    def freeze(self):
        for param in self.network.parameters():
            param.require_grad = False
        for param in self.network.fc.parameters():
            param.require_grad = True
    def unfreeze(self):
        for param in self.network.parameters():
            param.require_grad = True
model_ft = Net() 

Is this right / normal in those circumstances ? or the model will work best if the input is of dimension 3 (color) and not grayscale ?
I also freeze the weights before the training. I am sure this is correct for the FC layer but would this not have an effect on the Conv1 layer ?

Sorry just a novice and not looked into CNN in few years now.

Thanks in advance.

In your current approach you are replacing the pretrained conv1 layer with a randomly initialized nn.Conv2d layer. Afterwards you are freezing its random parameters without any training, so you might want to consider training it too.
Alternatively, you could also try to e.g. repeat the single channel of your input 3 times and pass it to the pretrained conv layer accepting inputs with 3 channels.

Thanks for your reply and sorry for asking.

  1. is there a away I can easily do this ? I think I’d rather keep the Conv layer untouched to avoid the retraining.
  2. Is there a version of mel spectrogram in RGB as it is displayed on the screen or does not make sense ?

I am also now facing another issue with normalizing the input to the resnet.
I do not want to crop the image as my algorithm is based on tagging images with the presence of the beep and cropping might remove beeps that are just entering the image.
Is it ok if I just used the following compose and give the output to the ResNet ?

I assume based on the issues above, I will not be able to use what mentioned here ResNet | PyTorch
as the spectrogram is not a simple image as in the example.

transform_spectra = T.Compose([
    mel_spectrogram,
    T.Resize((size,size)),
    T.Normalize(mean=[0.485, 0.456, 0.406],std=[0.229, 0.224, 0.225]),
])
  1. Yes, you could use x = x.repeat(1, 3, 1, 1) to repeat the single channel 3 times assuming x was in the shape [batch_size, 1, height, width].
  2. The visualization using a colormap and maps the single channel value to it. You could also try to use these colormap-encoded images, but I’m unsure if you would get any benefit from it.

I’m unsure what the exact issue with the normalization is. Are you concerned about the Resize operation or the stats values?

Hi I am mostly concerned with not providing the right value ranges for ResNet. It is not clear what the right conversion should be from the torchaudio.melspectrogram format to the the right format to input the ResNet. Most of the examples use PILImage and crop . I have not seen example taking torchaudio.melspectrogram and ingests into a resnet.

Thanks you.

The posted mean and std values are calculated from the ImageNet training dataset and normalize the input batches to have a zero mean and a unit variance. You could either calculate new stats from your training data or just try to use the normalized input tensors in the range [0, 1].

Ok let’s see if I understood well. I only need to normalize with the published transform if my input data is the same (or similar) as the one used for training the model. Otherwise as long as my data is normalized to be between [0,1] than I do not need any further normalization.

Just one more clarification is the range with positive values between 0 and 1 or 0 is the mean and 1 is the variance so values are between -1 and 1 ? does it matter ?

Thanks very much in advance.

It depends on your use case, but generally normalizing the data helps in model convergence.
The Normalize transformation subtracts the passed mean and divides by the std so that the samples have a zero mean and unit variance, which helps even more than a normalization via scaling.
I would recommend to experiment with a few approaches to see which one works for your use case.

1 Like