Distributed Data and Model

I am attempting to train a model with a size of only 5 million parameters. However, when I try to train the model with a batch size of 1, I encounter the “CUDA out of memory” error. I suspect that this issue arises due to the LSTM layers producing a significant number of intermediate tensors.

Upon this realization, I understand that data parallelism may not be a solution in this case, as even with a batch size of one, the problem persists. What steps should I take to address this issue?

I addressed the issue by manually splitting the model on different devices, but I am sure it’s not efficient.


5 million parameters seems like a pretty small model to OOM out on.

Do you have more details on your system and memory available?

In the case of OOMs you are correct that data parallelism may not be the solution, but there are workarounds. FSDP (Introducing PyTorch Fully Sharded Data Parallel (FSDP) API | PyTorch) can shard the model across multiple ranks which will save memory, but is still under the data parallelism umbrella. In the case where you manually split the model, that is a form of model parallelism. We have support for pipeline parallelism which performs automatic model splitting for you (GitHub - pytorch/PiPPy: Pipeline Parallelism for PyTorch)

I will check your links.
Here you can find more details about my system and memory