I am rather new with Wav2Vec 2.0 working with it since december for a speech emotion recognition project. I started with torchaudio.pipelines bundle of Wav2vec 2 base model and put a Classification Head on top of it and freeze the transformers CNN-Layers, but fine-tune the rest on my emotion recognition task. I could work with the last layer (12) output, but I’d like to rather acces output of layer 10.
I did this with model.extract_features(). How to use the model without storing each layers Output? I have a batch of 32 of length 400000… Therefore ( the reason I suspect) I can’t make it run on a GPU (Feature Layers leading to 12 x 32 x 1400 x 768 tensor storage). There is the argument num-layers, which outputs only the layers output up to a number (why only int instead of list??). I manipulated components.py to only return the layer I want ( nice! I am able to run batches of 32 instead of previous maximum of 25 on my local CPU). I also tried the transformer method and used model(x).last_hidden_state inside my Classification head model… With everything I tried I still get Memory overload of more than 80GB on gpu. I guess I am not using the right method for what I want to achieve… please help me!