Resnet transfer learning on sequences of data


I am trying to use the pretrained model resnet to perform classification on sequences of data. This sequence of data is a mel spectrogram of an audio waveform over time (speech).

It is important that I preserve the shape of the output from the model so that I can apply CTC loss to the predictions: CTCLoss — PyTorch 1.11.0 documentation

When passing the spectrogram through the model the tensor returned is of shape [batch_size, num_classes] but I need it to return [batch_size, input_length, num_classes]

I current have a setup like the following:

        # 1 channel input, 3 channel output
        self.conv1 = nn.Conv2d(1, 3, 2)

        backbone = models.resnet50(pretrained=True)
        num_filters = backbone.fc.in_features
        layers = list(backbone.children())[:-1]
        self.feature_extractor = nn.Sequential(*layers)

        num_target_classes = 28
        self.classifier = nn.Linear(num_filters, num_target_classes)


In the standard ResNet the output activation of the adaptive pooling layer is flattened in the forward method into a 2-dimensional tensor and passed to the linear layer. It should be possible to create a 3-dimensional tensor and pass it to self.fc.