PyTorch Pre-Allocation to avoid OOM does not work

So, I am trying to finetune FCoref using the trainer in GitHub - shon-otmazgin/fastcoref
This uses a Dynamic Batching with variable length and this creates an issue on CUDA because once PyTorch allocates memory for the first batch, it does not increase it.

So, following this guide here: Performance Tuning Guide — PyTorch Tutorials 1.12.1+cu102 documentation
I added this to my code and I call it before running the actual training (right after creating the model and moving it to CUDA):

batch = {
            "input_ids": torch.rand(9, 5, 512),
            "attention_mask": torch.rand(9, 5, 512),
            "gold_clusters": torch.rand(9, 58, 39, 2),
            "leftovers": {
                "input_ids": torch.rand(4),
                "attention_mask": torch.rand(4),
        batch['input_ids'] = torch.tensor(batch['input_ids'], device=self.device)
        batch['attention_mask'] = torch.tensor(batch['attention_mask'],
        batch['gold_clusters'] = torch.tensor(batch['gold_clusters'],
        if 'leftovers' in batch:
            batch['leftovers']['input_ids'] = torch.tensor(
                batch['leftovers']['input_ids'], device=self.device)
            batch['leftovers']['attention_mask'] = torch.tensor(
        with torch.cuda.amp.autocast():
            outputs = self.model(batch, gold_clusters=batch['gold_clusters'],

        loss = outputs[0]  # model outputs are always tuple in transformers (see doc)

At first, I was getting OOM issues with this because it was too big (I basically created the biggest tensors in each key according to my dataset).
So, instead, I created a batch that looks like my biggest batch in the actual data (according to the sum of tensor sizes):

batch = {
            "input_ids": torch.rand(4, 1, 512),
            "attention_mask": torch.rand(4, 1, 512),
            "gold_clusters": torch.rand(4, 11, 24, 2),
            "leftovers": {
                "input_ids": torch.rand(4, 459),
                "attention_mask": torch.rand(4, 459),

Now, this works but when the actual training starts, I run into the same issue even though the first batch is smaller than the pre-allocation batch:
OutOfMemoryError: CUDA out of memory. Tried to allocate 74.00 MiB (GPU 0; 14.56 GiB total capacity; 13.31 GiB already allocated; 36.44 MiB free; 13.76 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Other things I tried:

  1. export 'PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:21'
  2. Decreasing batch size, but due to the variability I keep running into the same issue.

My machine:

  • runs Debian 4.19.260-1 (2022-09-29) x86_64 GNU/Linux
  • T4 GPU with 16 GB VRAM

Any idea?

Could you install the latest stable or nightly release with the CUDA 11.7 runtime, as we’ve enabled lazy module loading, which should reduce the memory footprint significantly (assuming you are usign an older release)?

I am using CUDA11.6 but is lazy loading gonna solve this?
It is not much of a memory footprint issue as it is that the first memory allocation sticks and never changes for later variable size batches.