So, I am trying to finetune FCoref using the trainer in GitHub - shon-otmazgin/fastcoref
This uses a Dynamic Batching with variable length and this creates an issue on CUDA because once PyTorch allocates memory for the first batch, it does not increase it.
So, following this guide here: Performance Tuning Guide — PyTorch Tutorials 1.12.1+cu102 documentation
I added this to my code and I call it before running the actual training (right after creating the model and moving it to CUDA):
batch = {
"input_ids": torch.rand(9, 5, 512),
"attention_mask": torch.rand(9, 5, 512),
"gold_clusters": torch.rand(9, 58, 39, 2),
"leftovers": {
"input_ids": torch.rand(4),
"attention_mask": torch.rand(4),
}
}
batch['input_ids'] = torch.tensor(batch['input_ids'], device=self.device)
batch['attention_mask'] = torch.tensor(batch['attention_mask'],
device=self.device)
batch['gold_clusters'] = torch.tensor(batch['gold_clusters'],
device=self.device)
if 'leftovers' in batch:
batch['leftovers']['input_ids'] = torch.tensor(
batch['leftovers']['input_ids'], device=self.device)
batch['leftovers']['attention_mask'] = torch.tensor(
batch['leftovers']['attention_mask'],
device=self.device)
self.model.zero_grad()
self.model.train()
with torch.cuda.amp.autocast():
outputs = self.model(batch, gold_clusters=batch['gold_clusters'],
return_all_outputs=False)
loss = outputs[0] # model outputs are always tuple in transformers (see doc)
loss.backward()
At first, I was getting OOM issues with this because it was too big (I basically created the biggest tensors in each key according to my dataset).
So, instead, I created a batch that looks like my biggest batch in the actual data (according to the sum of tensor sizes):
batch = {
"input_ids": torch.rand(4, 1, 512),
"attention_mask": torch.rand(4, 1, 512),
"gold_clusters": torch.rand(4, 11, 24, 2),
"leftovers": {
"input_ids": torch.rand(4, 459),
"attention_mask": torch.rand(4, 459),
}
}
Now, this works but when the actual training starts, I run into the same issue even though the first batch is smaller than the pre-allocation batch:
OutOfMemoryError: CUDA out of memory. Tried to allocate 74.00 MiB (GPU 0; 14.56 GiB total capacity; 13.31 GiB already allocated; 36.44 MiB free; 13.76 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Other things I tried:
export 'PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:21'
- Decreasing batch size, but due to the variability I keep running into the same issue.
My machine:
- runs Debian 4.19.260-1 (2022-09-29) x86_64 GNU/Linux
- T4 GPU with 16 GB VRAM
Any idea?