Hello,

I am building a Sound classifier and using Torchaudio. The Wav file is loaded and then transformed this way:

```
mel_spectrogram = torchaudio.transforms.MelSpectrogram(
sample_rate=SAMPLE_RATE,
n_fft=1024,
hop_length=512,
n_mels=64
)
size=224
transform_spectra = T.Compose([
mel_spectrogram,
T.Resize(size),
T.CenterCrop(size),
T.ToTensor(),
T.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]),
])
```

I then created a network with ResNet

```
class Net(nn.Module):
def __init__(self):
super().__init__()
# Use a pretrained model
self.network = models.resnet34(pretrained=True)
# Replace last layer
num_ftrs = self.network.fc.in_features
self.network.fc = nn.Linear(num_ftrs, 2)
def forward(self, xb):
return self.network(xb)
def freeze(self):
for param in self.network.parameters():
param.require_grad = False
for param in self.network.fc.parameters():
param.require_grad = True
def unfreeze(self):
for param in self.network.parameters():
param.require_grad = True
```

and just changed the last FC to only have 2 classes.

However I get this error when I run the training.

```
/opt/conda/lib/python3.8/site-packages/torchvision/transforms/functional.py in to_tensor(pic)
112 """
113 if not(F_pil._is_pil_image(pic) or _is_numpy(pic)):
--> 114 raise TypeError('pic should be PIL Image or ndarray. Got {}'.format(type(pic)))
115
116 if _is_numpy(pic) and not _is_numpy_image(pic):
TypeError: pic should be PIL Image or ndarray. Got <class 'torch.Tensor'>
```

Can anybody help ?

P.S. Is there some tutorial for training data converted to melspectrograms with a ResNet ? I would think this is a standard problem for sound clarification but not many tutorial out there.

UPDATE:

I modified the transfor and removed the ToTensor

```
transform_spectra = T.Compose([
mel_spectrogram,
T.Resize(size),
T.CenterCrop(size),
#T.ToTensor(),
T.Normalize(mean=[0.485, 0.456, 0.406],std=[0.229, 0.224, 0.225]),
T.Normalize((0.5,),(0.5,)),
])
```

but then I received the following error.

```
Given groups=1, weight of size [64, 3, 7, 7], expected input[1, 1, 224, 224] to have 3 channels, but got 1 channels instead
```

I then googled and a bit and then modified the net and changed the input size like below:

```
class Net(nn.Module):
def __init__(self):
super().__init__()
# Use a pretrained model
self.network = models.resnet34(pretrained=True)
# Replace last layer
self.network.conv1=nn.Conv2d(1, self.network.conv1.out_channels,
kernel_size=self.network.conv1.kernel_size[0],
stride=self.network.conv1.stride[0],
padding=self.network.conv1.padding[0])
num_ftrs = self.network.fc.in_features
self.network.fc = nn.Linear(num_ftrs, 2)
def forward(self, xb):
return self.network(xb)
def freeze(self):
for param in self.network.parameters():
param.require_grad = False
for param in self.network.fc.parameters():
param.require_grad = True
def unfreeze(self):
for param in self.network.parameters():
param.require_grad = True
model_ft = Net()
```

Is this right / normal in those circumstances ? or the model will work best if the input is of dimension 3 (color) and not grayscale ?

I also freeze the weights before the training. I am sure this is correct for the FC layer but would this not have an effect on the Conv1 layer ?

Sorry just a novice and not looked into CNN in few years now.

Thanks in advance.