Unexpected out of memory error when training with L-BFGS

For some reason the code below seems to have a memory leak. I can’t seem to figure out the exact cause, but it only occurs when using the L-BFGS optimizer.

import torch

from typing import Tuple

class SampleDataset(torch.utils.data.Dataset):

    """Simple dataset for collected samples."""

    def __init__(self, data: torch.Tensor, labels: torch.Tensor) -> None:

        self.data = [data[i].clone() for i in range(data.shape[0])]

        self.labels = [labels[i].clone() for i in range(labels.shape[0])]

    def __getitem__(self, idx: int) -> Tuple[torch.Tensor, int]:

        return self.data[idx], self.labels[idx]

    def __len__(self) -> int:

        return len(self.data)

def train_model(model: torch.nn.Module, dataloader, device: torch.device, num_epochs: int = 5):

    criterion = torch.nn.CrossEntropyLoss()

    optimizer = torch.optim.LBFGS(

        model.parameters(), lr=1.0, max_iter=1, tolerance_change=-1, tolerance_grad=-1

    )

    for n in range(num_epochs):

        print("Epoch", n + 1)

        for inputs, labels in dataloader:

            inputs, labels = inputs.to(device), labels.to(device)

            def closure() -> torch.Tensor:

                optimizer.zero_grad()

                output = model(inputs)

                loss = criterion(output, labels)

                loss.backward()

                return loss

            optimizer.step(closure)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

activation_samples = torch.randn(363, 2560, 9, 9)

activation_labels = list(range(0, 121)) * 3

activation_labels = torch.as_tensor(activation_labels)

sample_data = activation_samples.reshape(activation_samples.shape[0], -1).double()

# Setup dataset

sample_dataset = SampleDataset(sample_data.cpu(), activation_labels.cpu())

dataloader = torch.utils.data.DataLoader(

        sample_dataset, batch_size=8, num_workers=0, shuffle=True

    )

model = torch.nn.Linear(sample_data.shape[1], 121, bias=False).to(device).double()
model = model.train()

train_model(model, dataloader, device)

I don’t think this is a memory leak, but the expected high memory requirement of LBFGS.
From the docs:

This is a very memory intensive optimizer (it requires additional param_bytes * (history_size + 1) bytes). If it doesn’t fit in memory try reducing the history size, or use a different algorithm.

By default a history_size of 100 is used so:

model.weight.nelement() * 4 / 1024**3 * 101
9.440431594848633

~9.44 GB would be needed additionally to the “standard” memory usage.
Limiting the history_size to 10 uses ~6.8GB in my setup.

@ptrblck The code is running out of memory with 16GB of GPU memory. I also think that I did have it working at first, before I changed something slightly and ended up with the out of memory error, so I know it’s possible on my system.

The error message when it crashes for me is:

Epoch 1
CUDA out of memory. Tried to allocate 192.00 MiB (GPU 0; 14.76 GiB total capacity; 13.32 GiB already allocated; 41.75 MiB free; 13.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I’ve tested with PyTorch versions: torch 1.10.0+cu111 & torch-1.11.0.

The original code is also running OOM on a 40GB A100, so unless you changed the history_size or the model itself, I wouldn’t know how you could fit it into the 16GB.
Is the script still running OOM after setting history_size=10?

Thanks for testing with a 40GB device! I’m now thinking that my “successful” test could have used SGD (as I had the option for it as well) and the notebook history didn’t record that for some reason.

I had another similar function that uses far fewer classes, and it works with L-BFGS, but I broke it around the same time as it just barely fits in memory. I think that confused me a bit as the issue with that one was that I needed to be more careful about creating & storing tensors.