I am trying to use the pretrained model resnet to perform classification on sequences of data. This sequence of data is a mel spectrogram of an audio waveform over time (speech).
It is important that I preserve the shape of the output from the model so that I can apply CTC loss to the predictions: CTCLoss — PyTorch 1.11.0 documentation
When passing the spectrogram through the model the tensor returned is of shape [batch_size, num_classes] but I need it to return [batch_size, input_length, num_classes]
I current have a setup like the following:
# 1 channel input, 3 channel output self.conv1 = nn.Conv2d(1, 3, 2) backbone = models.resnet50(pretrained=True) num_filters = backbone.fc.in_features layers = list(backbone.children())[:-1] self.feature_extractor = nn.Sequential(*layers) num_target_classes = 28 self.classifier = nn.Linear(num_filters, num_target_classes)