Running out of memory regardless of how much GPU is allocated to the process

gramah · November 25, 2021, 3:06am

I almost always run out of memory in the first pass of my training loop. From the looks of it Pytorch allocates as much memory as possible for the model. I’ve tried torch.cuda.set_per_process_memory_fraction() and have found that the model can be fit into 7gb or 13gb of GPU memory, but in both cases it doesn’t leave enough room for batches and/or backward(). Is there a way to work with this?

with full memory (16 GB) it dies on backward

RuntimeError: CUDA out of memory. Tried to allocate 786.00 MiB (GPU 0; 15.90 GiB total capacity; 14.56 GiB already allocated; 161.75 MiB free; 14.64 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

with partial memory (8 GB) it dies putting the batch onto the GPU:

RuntimeError: CUDA out of memory. Tried to allocate 192.00 MiB (GPU 0; 15.90 GiB total capacity; 7.81 GiB already allocated; 6.93 GiB free; 7.95 GiB allowed; 7.86 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Heres my Dataset:

class Pg19_Dataset(Dataset):
    def __init__(self, data): # path=None):
      self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        length = 1024

        tokenized = torch.squeeze(self.data[idx])
        label_idx = random.randrange(4, len(tokenized))

        item = {}
        tokens = tokenized[max(label_idx - length, 0): label_idx]

        item['input_ids'] = torch.zeros((length, ))
        item['input_ids'][: len(tokens)] = tokens 
        item['input_ids'] = item['input_ids'].long()
        item['labels'] = torch.clone(item['input_ids'])
        item['attention_mask'] = torch.zeros((length, ))
        item['attention_mask'][: len(tokens)] = 1 
        item['attention_mask'] = item['attention_mask'].half()

        return item

the DataLoader:

train_loader = DataLoader(
    train_dataset, 
    batch_size=4, 
    shuffle=True, 
    num_workers=os.cpu_count()
)

training loop:

model.train()
model.to(device)

optim = torch.optim.AdamW(model.parameters(), lr=5e-5)

# torch.cuda.set_per_process_memory_fraction(0.5)

for epoch in range(10):
  loss_list = []
  with tqdm(train_loader) as t:
    for batch in t:
        optim.zero_grad()
        device_batch = {key: value.to(device) for key, value in batch.items()}
        outputs = model(**device_batch)
        loss = outputs[0]
        loss.backward()
        torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=1.0)
        optim.step()

        loss_list.append(loss.item())
        total_memory = torch.cuda.memory_allocated(device) * 9.31e-10
        t.set_postfix(loss=sum(loss_list)/len(loss_list), gpu_mem=total_memory)

Is there a way to limit how much memory a model uses to leave room for batches and backward passes?

eqy · November 25, 2021, 3:19am

The amount of memory used by the model’s parameters may not make up the bulk of memory usage during training so in general this is a difficult problem to solve without more complexity like activation checkpointing or model-parallelism with multiple GPUs. The first thing you might try is to use some form of mixed-precision e.g., moving parts of the model to fp16 or bfloat16 which would potentially also improve speed

gramah · November 27, 2021, 4:57am

Thanks for the response! Between scaling the model down and reducing precision I’ve gotten something that works.