I almost always run out of memory in the first pass of my training loop. From the looks of it Pytorch allocates as much memory as possible for the model. I’ve tried torch.cuda.set_per_process_memory_fraction()
and have found that the model can be fit into 7gb or 13gb of GPU memory, but in both cases it doesn’t leave enough room for batches and/or backward(). Is there a way to work with this?
with full memory (16 GB) it dies on backward
RuntimeError: CUDA out of memory. Tried to allocate 786.00 MiB (GPU 0; 15.90 GiB total capacity; 14.56 GiB already allocated; 161.75 MiB free; 14.64 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
with partial memory (8 GB) it dies putting the batch onto the GPU:
RuntimeError: CUDA out of memory. Tried to allocate 192.00 MiB (GPU 0; 15.90 GiB total capacity; 7.81 GiB already allocated; 6.93 GiB free; 7.95 GiB allowed; 7.86 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Heres my Dataset:
class Pg19_Dataset(Dataset):
def __init__(self, data): # path=None):
self.data = data
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
length = 1024
tokenized = torch.squeeze(self.data[idx])
label_idx = random.randrange(4, len(tokenized))
item = {}
tokens = tokenized[max(label_idx - length, 0): label_idx]
item['input_ids'] = torch.zeros((length, ))
item['input_ids'][: len(tokens)] = tokens
item['input_ids'] = item['input_ids'].long()
item['labels'] = torch.clone(item['input_ids'])
item['attention_mask'] = torch.zeros((length, ))
item['attention_mask'][: len(tokens)] = 1
item['attention_mask'] = item['attention_mask'].half()
return item
the DataLoader:
train_loader = DataLoader(
train_dataset,
batch_size=4,
shuffle=True,
num_workers=os.cpu_count()
)
training loop:
model.train()
model.to(device)
optim = torch.optim.AdamW(model.parameters(), lr=5e-5)
# torch.cuda.set_per_process_memory_fraction(0.5)
for epoch in range(10):
loss_list = []
with tqdm(train_loader) as t:
for batch in t:
optim.zero_grad()
device_batch = {key: value.to(device) for key, value in batch.items()}
outputs = model(**device_batch)
loss = outputs[0]
loss.backward()
torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=1.0)
optim.step()
loss_list.append(loss.item())
total_memory = torch.cuda.memory_allocated(device) * 9.31e-10
t.set_postfix(loss=sum(loss_list)/len(loss_list), gpu_mem=total_memory)
Is there a way to limit how much memory a model uses to leave room for batches and backward passes?