CUDA out of Memory after few epochs

After 4 epochs I am getting error CUDA out of Memory
I am using Wav2Vec2 HuggingFace Model with PyTorch Training Setup

Cuda Memory Summary Initially

|                  PyTorch CUDA memory summary, device ID 0                 |                                          
|---------------------------------------------------------------------------|                                          
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |                                          
|===========================================================================|                                          
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |                                          
|---------------------------------------------------------------------------|                                          
| Allocated memory      |  369906 KB |  369906 KB |  369906 KB |       0 B  |                                          
|       from large pool |  368384 KB |  368384 KB |  368384 KB |       0 B  |                                          
|       from small pool |    1522 KB |    1522 KB |    1522 KB |       0 B  |                                          
|---------------------------------------------------------------------------|                                          
| Active memory         |  369906 KB |  369906 KB |  369906 KB |       0 B  |                                          
|       from large pool |  368384 KB |  368384 KB |  368384 KB |       0 B  |                                          
|       from small pool |    1522 KB |    1522 KB |    1522 KB |       0 B  |                                          
|---------------------------------------------------------------------------|                                          
| GPU reserved memory   |  409600 KB |  409600 KB |  409600 KB |       0 B  |                                          
|       from large pool |  407552 KB |  407552 KB |  407552 KB |       0 B  |                                          
|       from small pool |    2048 KB |    2048 KB |    2048 KB |       0 B  |                                          
|---------------------------------------------------------------------------|                                          
| Non-releasable memory |   39694 KB |   50508 KB |  263679 KB |  223985 KB |                                          
|       from large pool |   39168 KB |   48896 KB |  261632 KB |  222464 KB |                                          
|       from small pool |     526 KB |    2047 KB |    2047 KB |    1521 KB |                                          
|---------------------------------------------------------------------------|                                          
| Allocations           |     251    |     251    |     251    |       0    |                                          
|       from large pool |      80    |      80    |      80    |       0    |                                          
|       from small pool |     171    |     171    |     171    |       0    |                                          
|---------------------------------------------------------------------------|                                          
| Active allocs         |     251    |     251    |     251    |       0    |                                          
|       from large pool |      80    |      80    |      80    |       0    |                                          
|       from small pool |     171    |     171    |     171    |       0    |                                          
|---------------------------------------------------------------------------|            
| GPU reserved segments |      21    |      21    |      21    |       0    |                                          
|       from large pool |      20    |      20    |      20    |       0    |                                          
|       from small pool |       1    |       1    |       1    |       0    |                                          
|---------------------------------------------------------------------------|                                          
| Non-releasable allocs |      19    |      19    |      20    |       1    |                                          
|       from large pool |      18    |      18    |      19    |       1    |                                          
|       from small pool |       1    |       1    |       1    |       0    |                                          
|---------------------------------------------------------------------------|                                          
| Oversize allocations  |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------| 
| Oversize GPU segments |       0    |       0    |       0    |       0    |                                          
|===========================================================================|    

Cuda Memory Summary After Epoch 1.

|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |    2680 MB |    3158 MB |    3642 GB |    3639 GB |
|       from large pool |     377 MB |     812 MB |    3440 GB |    3440 GB |
|       from small pool |    2302 MB |    2346 MB |     201 GB |     199 GB |
|---------------------------------------------------------------------------|
| Active memory         |    2680 MB |    3158 MB |    3642 GB |    3639 GB |
|       from large pool |     377 MB |     812 MB |    3440 GB |    3440 GB |
|       from small pool |    2302 MB |    2346 MB |     201 GB |     199 GB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |    2772 MB |    3282 MB |    3282 MB |  522240 KB |
|       from large pool |     418 MB |     882 MB |     882 MB |  475136 KB |
|       from small pool |    2354 MB |    2400 MB |    2400 MB |   47104 KB |
|---------------------------------------------------------------------------|
| Non-releasable memory |   93778 KB |  126483 KB |    3056 GB |    3056 GB |
|       from large pool |   41216 KB |   70496 KB |    2826 GB |    2826 GB |
|       from small pool |   52562 KB |   56630 KB |     229 GB |     229 GB |
|---------------------------------------------------------------------------|
| Allocations           |   19128    |   19392    |    2592 K  |    2573 K  |
|       from large pool |      81    |     198    |    1308 K  |    1308 K  |
|       from small pool |   19047    |   19209    |    1284 K  |    1265 K  |
|---------------------------------------------------------------------------|
| Active allocs         |   19128    |   19392    |    2592 K  |    2573 K  |
|       from large pool |      81    |     198    |    1308 K  |    1308 K  |
|       from small pool |   19047    |   19209    |    1284 K  |    1265 K  |
|---------------------------------------------------------------------------|
| GPU reserved segments |    1198    |    1238    |    1238    |      40    |
|       from large pool |      21    |      38    |      38    |      17    |
|       from small pool |    1177    |    1200    |    1200    |      23    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |    1252    |    1258    |    1960 K  |    1959 K  |
|       from large pool |      19    |      26    |    1064 K  |    1064 K  |
|       from small pool |    1233    |    1239    |     896 K  |     895 K  |
|---------------------------------------------------------------------------|
| Oversize allocations  |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize GPU segments |       0    |       0    |       0    |       0    |
|===========================================================================|

The currently posted memory summary outputs show CUDA OOMs: 0 so could you describe the issue in more detail, please?

I am getting the following error

RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.76 GiB total capacity; 13.32 GiB already allocated; 3.75 MiB free; 13.64 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

If I don’t use the Wav2Vec2 from HuggingFace and I use PyTorch Models, then it is working fine . I tried to reduce batch size to 8 as well but after epoch 5 it is throwing CUDA out of Memory Error

The error message explains that your GPU has only 3.75MiB of free memory while you are trying to allocate 2MiB. The free memory is not necessarily assigned to a single block, so the OOM error might be expected.
I’m not familiar with the mentioned model, but you might need to decrease the batch size further.

I tried to make batch size even smaller to 4, but it is still showing the same error

Based on this post it seems a GPU with 32GB should “be enough to fine-tune the model”, so you might need to either further decrease the batch size and/or the sequence lengths, since you are still running OOM on your 15GB device.

@ptrblck
It is not working even I i make batch size of 1
and input sequence length of each batch is 3 seconds only

@ptrblck
If we use the PyTorch Wav2Vec2 implementation to fine-tune it for downstream audio classification

Is my code snippet Corrrect

class AudioClassifier(nn.Module):
        def __init__():
             super(AudioClassification, self).__init__()
             self.bundle = torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H
             self.model_w2v2 =self.bundle.get_model()
             self.fc = torch.nn.Linear(dim_out_w2v2, C)   #C=5 is number of classes 


        def forward(selc, inp):
             inp = torch.squeeze(inp, 1)   #inp Shape is (Batch X 1 X AudioLength)
             features, _  = self.model_w2v2(inp)
             sp = self.stat_pool_layer(features)    #Applies Statistical Pooling Layer 
             out = self.fc(out)
             return out              

     

Then followed by PyTorch Framework Framework by freezing some of the layers in Wav2Vec Model
Loss function is Cross Entropy of 5 classes

or Do I need to take care of any other thing in fine tuning Wav2Vec2

@ptrblck :point_up_2:

Sorry, as already mentioned I’m not familiar with with model and don’t see any error besides the expected OOM. If you can verify that the model should indeed fit on your GPU, let me know. Otherwise, I would expect to see this error and would either suggest to lower the batch size / sample length, or use checkpointing etc. to trade compute for memory.

Sorry for the confusion
My question is different now
Now I shifted to PyTorch implementation of Wav2Vec2 and I am trying to fine tune it with one linear layer on top of it
Is my code correct
While training I need to freeze some layers of the wav2vec2 model

I don’t think your code is correct since it assumes the output of the model are features, while I would assume these are logits as described in this tutorial:

Once the acoustic features are extracted, the next step is to classify them into a set of categories.
Wav2Vec2 model provides method to perform the feature extraction and classification in one step.
with torch.inference_mode(): emission, _ = model(waveform)
The output is in the form of logits. It is not in the form of probability.

In this case you could check if calling extract_features would be a better solution.

It is bit confusing

Could you please provide a code snippet for fine tuning PyTorch version of wav2vec2 for Audio classification

@ptrblck
Could you please help it out