I am researching on using pretrained VGGish model for audio classification tasks, ideally I could have a model classifying any of the classes defined in the google audioset. I came across a nice pytorch port for generating audio features. The original model generates only audio features as well. The original team suggests generally the following way to proceed:
As a feature extractor : VGGish converts audio input features into a semantically meaningful, high-level 128-D embedding which can be fed as input to a downstream classification model. The downstream model can be shallower than usual because the VGGish embedding is more semantically compact than raw audio features. So, for example, you could train a classifier for 10 of the AudioSet classes by using the released embeddings as features. Then, you could use that trained classifier with any arbitrary audio input by running the audio through the audio feature extractor and VGGish model provided here, passing the resulting embedding features as input to your trained model.
vggish_inference_demo.pyshows how to produce VGGish embeddings from arbitrary audio
I’m not sure how to go about using getting the released embeddings and using them for training in pytorch. I’m also not sure how to translate the embeddings into classification. Could any one kindly share some pointers? Thanks!