Out of Memory after 2 hours

I’m developing a language classifier. In order to do that, I’ve downloaded Common Voice in 34 languages and a pretrained Wav2Vec2 Model that I want to finetune, to solve this task. The training procedure is parallelized with pytorch lightning to run on 8 RTX 3090. The System has 96GB of CPU RAM.

The Problem is, that my CPU memory consumption jumps to 70 GB initially and creeps up by several MB per step, until it crashes after an hour or two. So far, I’ve been unable to track down the memory error. I’ve tried several memory profilers and debugging techniques but with no success. Here is what I’ve tried:

  • PDB with tracemalloc or objgraph: I like these tools but sadly the memory consumption didn’t show there, which makes me think that an external library keeps leaking
  • Valgrind: After 90 Minutes in the initialization phase i’ve stopped this attempt. Valgrind slows down python too much
  • I’ve compared tracemallocs measured memory with psutil.Process(os.getpid).memory_info . While tracemalloc did not show the memory consumption, psutil did
  • I’ve hypothesized that my high memory consumption is because torchaudio.datasets.COMMONVOICE loads the entire subdirectory as millions of short strings which produces a huge memory overhead. To reduce it I’ve forked torchaudio.datasets.COMMONVOICE and changed it so that it uses numpy arrays instead. Actually, this has reduced my memory consumption by 10 GB but the memory still leaked, so this didn’t solve the issue.

My current guess is that torchaudio.load is the issue. I’ve trained a very similar approach that uses WAV instead of Commonvoice’s MP3 files and there was no leak.

Do you have any Ideas on how I could track the issue?

Update: Just for the case someone asks: The num_workers parameter of the dataloader is 0

You could replace the torchaudio dataset with a fake or custom one and check if the memory increase would still be visible. If not, you could then iterate the original dataset only and check if this might trigger the increase or if it’s only visible in a training loop.

thanks for the hint. I will try that tomorrow and report the results here :+1: