Using pytorch vggish for audio classification tasks

I am researching on using pretrained VGGish model for audio classification tasks, ideally I could have a model classifying any of the classes defined in the google audioset. I came across a nice pytorch port for generating audio features. The original model generates only audio features as well. The original team suggests generally the following way to proceed:

As a feature extractor : VGGish converts audio input features into a semantically meaningful, high-level 128-D embedding which can be fed as input to a downstream classification model. The downstream model can be shallower than usual because the VGGish embedding is more semantically compact than raw audio features. So, for example, you could train a classifier for 10 of the AudioSet classes by using the released embeddings as features. Then, you could use that trained classifier with any arbitrary audio input by running the audio through the audio feature extractor and VGGish model provided here, passing the resulting embedding features as input to your trained model. vggish_inference_demo.py shows how to produce VGGish embeddings from arbitrary audio

I’m not sure how to go about using getting the released embeddings and using them for training in pytorch. I’m also not sure how to translate the embeddings into classification. Could any one kindly share some pointers? Thanks!

Based on the description you’ve posted it seems the authors call the output features embeddings.
This might be a bit confusing, as there are nn.Embedding layers, which are apparenrently not meant here.

If I understand the use case correctly, you could store each output feature of the VGGish model with its corresponding target, create a new classification model, and use these output features + targets to train this new classifier.

According to their claim, this classifier can be “shallower”, as the “embeddings” are so great. :wink:

Let me know, if that makes sense.

1 Like

you could store each output feature of the VGGish model with its corresponding target, create a new classification model, and use these output features + targets to train this new classifier.

Cool. Thanks for the tip! Would this seem like reasonable steps to train a new model?

  1. Download audio wav samples from audioset as training data
  2. Send each audio wav sample thru the VGGish to get a corresponding 128-dimension vector (output feature)
  3. Define a Dataset comprising the VGGish output feature as input (x) and the corresponding target (y)
  4. Using nn.module, define a “shallow” model using single layer…say Linear() .
  5. train

Yes, your approach seems reasonable!
Let us know, how your experiments went. :slight_smile:

Hi, I want to get a 128-dimension feature of my own video data, what should I do?

For audio classification of video data, perhaps you could first extract the audio into wav file and slice them into short audio clips using a tool like FFMPEG with filename corresponding to the time segment of the video. The audio clips can be sent into the network to extract features.