Memory/Performance of Training Wav2vec2 Model

Hey All,

I am rather new with Wav2Vec 2.0 working with it since december for a speech emotion recognition project. I started with torchaudio.pipelines bundle of Wav2vec 2 base model and put a Classification Head on top of it and freeze the transformers CNN-Layers, but fine-tune the rest on my emotion recognition task. I could work with the last layer (12) output, but I’d like to rather acces output of layer 10.
I did this with model.extract_features(). How to use the model without storing each layers Output? I have a batch of 32 of length 400000… Therefore ( the reason I suspect) I can’t make it run on a GPU (Feature Layers leading to 12 x 32 x 1400 x 768 tensor storage). There is the argument num-layers, which outputs only the layers output up to a number (why only int instead of list??). I manipulated components.py to only return the layer I want ( nice! I am able to run batches of 32 instead of previous maximum of 25 on my local CPU). I also tried the transformer method and used model(x).last_hidden_state inside my Classification head model… With everything I tried I still get Memory overload of more than 80GB on gpu. I guess I am not using the right method for what I want to achieve… please help me! :slight_smile:

I don’t know how the extract_features method is implemented, but your approach of returning the desired activation/feature sounds valid.
Alternatively, you could also try to use forward hooks as described here assuming you are interested in an output of a specific nn.Module.
If you don’t want to train the model, you could also wrap the forward pass into a with torch.no_grad() statement to delete the intermediate activations which would otherwise be needed for the gradient computation.

2 Likes

Thank you very much for the quick reply!
I think I can resolve this
Yes, the storing of all transformer layers output is bad (If you only need certain layers).
I implemented a method where I slice the original Wav2vec2 base model up to the transformer layer, which I desire the output from.
By doing this and freezing the first 20 layers (I’d like to only freeze the CNN layers) for training and setting batch to 20 instead of 32 i am able to run it on a GPU with ~80GB Memory. Wish I had more memory :confused: