Hello!
I am trying to use the pretrained model resnet to perform classification on sequences of data. This sequence of data is a mel spectrogram of an audio waveform over time (speech).
It is important that I preserve the shape of the output from the model so that I can apply CTC loss to the predictions: CTCLoss — PyTorch 1.11.0 documentation
When passing the spectrogram through the model the tensor returned is of shape [batch_size, num_classes] but I need it to return [batch_size, input_length, num_classes]
I current have a setup like the following:
# 1 channel input, 3 channel output
self.conv1 = nn.Conv2d(1, 3, 2)
backbone = models.resnet50(pretrained=True)
num_filters = backbone.fc.in_features
layers = list(backbone.children())[:-1]
self.feature_extractor = nn.Sequential(*layers)
num_target_classes = 28
self.classifier = nn.Linear(num_filters, num_target_classes)
thanks!