BatchMemoryManager with Opacus in Lightning

IoutofAI · June 25, 2023, 1:40am

I’ve implemented Opacus with my Lightning training script for an NLP application. I’m having issues with GPU out-of-memory errors that I’m not able to resolve. So, I’m looking to implement BatchMemoryManager to increase my batch size while preserving GPU memory. My question is:

How do I implement BatchMemoryManager with Lightning? Is that supported currently?

My current implementation follows the same skeleton as the Opacus Lightning tutorial here: opacus/examples/mnist_lightning.py at main · pytorch/opacus · GitHub

Thanks in advance for any help

IoutofAI · June 25, 2023, 2:25am

I’m thinking it could be implemented with a custom Lightning Loop to activate the BatchMemoryManager, as explained here: Train anything with Lightning custom Loops | by PyTorch Lightning team | PyTorch Lightning Developer Blog

Has anybody done this before?

Ruchen-Liu · March 2, 2024, 11:10am

Hi there, may I ask how did you eventually do this?

Ruchen-Liu · March 5, 2024, 9:59am

I think I have solved this with defining train_dataloader() inside the Lightning module and therefore I can access self.optimizers() with wrap_dataloader() from opacus.