XPU out of memory error with Intel Arc Graphics (Meteor Lake) despite sufficient system memory and reported XPU capacity

Issue Description:
I’m experiencing an XPU out of memory error when fine-tuning a LLM using Axolotl with PyTorch 2.7.0+xpu on a ThinkPad T14p Gen2 running Manjaro Linux.

PyTorch reports that the GPU has 28.56 GiB total capacity, but fails to allocate 7.09 GiB, even though system memory is plentiful and none is used by PyTorch at the time.

There is no option in the BIOS to modify integrated GPU memory allocation, and lspci only shows a 256MB prefetchable memory region, possibly related to the visible VRAM window. However, this does not match the memory PyTorch claims is available.


Hardware & Environment Details:

  • Device: ThinkPad T14p Gen2

  • CPU: Intel Core Ultra 5 125H

  • GPU: Intel Arc Graphics (Meteor Lake-P)

  • Driver: i915 (and xe?

  • Kernel: 6.12.25-1-MANJARO

  • PyTorch Version: 2.7.0+xpu

  • Axolotl Version: 0.10.0.dev0

  • System RAM: 30 GiB (No Swap)

  • Memory Free at Crash: 25 GiB

  • lspci -v | grep -A 8 VGA:

    00:02.0 VGA compatible controller: Intel Corporation Meteor Lake-P [Intel Arc Graphics] (rev 08) (prog-if 00 [VGA controller])
      Subsystem: Lenovo Device 50e9
      Flags: bus master, fast devsel, latency 0, IRQ 157
      Memory at 4058000000 (64-bit, prefetchable) [size=16M]
      Memory at 4000000000 (64-bit, prefetchable) [size=256M]
      Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
      Capabilities: <access denied>
      Kernel driver in use: i915
      Kernel modules: i915, xe
    

Error Trace:

Loading checkpoint shards:   0%|                                                                                                                                                             | 0/4 [00:00<?, ?it/s][2025-05-06 18:36:50,956] [ERROR] [axolotl.utils.models.load_model:1273] [PID:6160] [RANK:0] XPU out of memory. Tried to allocate 7.09 GiB. GPU 0 has a total capacity of 28.56 GiB. Of the allocated memory 0 bytes is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. Please use `empty_cache` to release all unoccupied cached memory.
Traceback (most recent call last):
  File "/home/libchara/finetune_intel/axolotl/src/axolotl/utils/models.py", line 1270, in load_model
    skip_move_to_device = self.build_model(qlora_fsdp)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/libchara/finetune_intel/axolotl/src/axolotl/utils/models.py", line 1128, in build_model
    self.model = self.auto_model_loader.from_pretrained(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/libchara/finetune_intel/finetune/lib/python3.12/site-packages/transformers/models/auto/auto_factory.py", line 571, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/libchara/finetune_intel/finetune/lib/python3.12/site-packages/transformers/modeling_utils.py", line 279, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/libchara/finetune_intel/finetune/lib/python3.12/site-packages/transformers/modeling_utils.py", line 4399, in from_pretrained
    ) = cls._load_pretrained_model(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/libchara/finetune_intel/finetune/lib/python3.12/site-packages/transformers/modeling_utils.py", line 4793, in _load_pretrained_model
    caching_allocator_warmup(model_to_load, expanded_device_map, factor=2 if hf_quantizer is None else 4)
  File "/home/libchara/finetune_intel/finetune/lib/python3.12/site-packages/transformers/modeling_utils.py", line 5803, in caching_allocator_warmup
    _ = torch.empty(byte_count // factor, dtype=torch.float16, device=device, requires_grad=False)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: XPU out of memory. Tried to allocate 7.09 GiB. GPU 0 has a total capacity of 28.56 GiB. Of the allocated memory 0 bytes is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. Please use `empty_cache` to release all unoccupied cached memory.

Additional Notes:

  • BIOS has no configurable setting for iGPU memory or ReBAR.
  • Intel Arc Graphics is iGPU; there is no discrete GPU.
  • According to Intel tools and forums, iGPU use system memory via dynamic allocation (DVMT), but behavior here seems inconsistent with reported memory.

Expected Behavior:

  • If PyTorch reports 28 GiB available, it should not fail to allocate 7 GiB.
  • Or provide better diagnostics or warnings if host-side DVMT or BAR settings limit actual usable memory.

If need any additional details, please let me know.

Hi libchara!

I have a system similar to yours (meteor-lake thinkpad with 32 GB system ram) and have
worked with xpu tensors in excess of 10GB using, for example, 2.7.0+xpu (downloaded
as RC1).

I do believe you’re correct that the “xpu” uses system ram (i.e., cpu ram) for its memory.
It looks odd to me that if your system has 30 GB, pytorch would be reporting 28 GB
available. I would expect your os and especially your windowing system (which I assume
you run) take up several (i.e., more than two) GB before you even start pytorch.

Does your system have a typical linux “system monitor” that shows you how much memory
is already in use before you start python / pytorch?

You might try starting a fresh pytorch session and see how large a tensor you can create
directly on the xpu before you get an out-of-memory error.

(Also, although I would expect it to be smart enough not to do so, is it possible that your
model-loading scheme first loads – or decompresses – parts of the model into memory
under control of the cpu and then moves those tensors to memory under control of the
xpu? If so, you could have two copies of some of your model parameters stored twice
in system memory during the loading process.)

I don’t know how pytorch goes about determining the amount of memory available to it,
especially when using the xpu. Note that intel arc xpu support is a work in progress and
not fully baked, so I would suggest you test pytorch’s reported available memory by
trying to create some xpu tensors of known size.

Best.

K. Frank

1 Like

I wrote a little script for testing XPU memory allocation and now I realized that he fails to allocate 4GB every time.

[libchara@libchara-ThinkPad-T14p-Gen-2 ~]$ cd finetune_intel/
[libchara@libchara-ThinkPad-T14p-Gen-2 finetune_intel]$ source finetune/bin/activate                 
(finetune) [libchara@libchara-ThinkPad-T14p-Gen-2 finetune_intel]$ source /opt/intel/oneapi/setvars.sh 
 
:: initializing oneAPI environment ...
   bash: BASH_VERSION = 5.2.37(1)-release
   args: Using "$@" for setvars.sh arguments: 
:: ccl -- latest
:: compiler -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dnnl -- latest
:: dpl -- latest
:: mkl -- latest
:: mpi -- latest
:: pti -- latest
:: tbb -- latest
:: umf -- latest
:: oneAPI environment initialized ::
 
(finetune) [libchara@libchara-ThinkPad-T14p-Gen-2 finetune_intel]$ python3 xpu_mem_test.py 

=== XPU Memory Test ===
Device Name: Intel(R) Arc(TM) Graphics

Device Memory: 14.06GB total
Max allocated during session: 0.00GB

Attempting to allocate 1.0GB tensor...
Success! Current allocated: 1.00GB
               total        used        free      shared  buff/cache   available
Mem:            30Gi       4.8Gi        23Gi       460Mi       3.2Gi        26Gi
Swap:             0B          0B          0B

Attempting to allocate 1.1GB tensor...
Success! Current allocated: 1.10GB
               total        used        free      shared  buff/cache   available
Mem:            30Gi       4.9Gi        23Gi       460Mi       3.2Gi        25Gi
Swap:             0B          0B          0B

Attempting to allocate 1.2000000000000002GB tensor...
Success! Current allocated: 1.20GB
               total        used        free      shared  buff/cache   available
Mem:            30Gi       6.1Gi        22Gi       460Mi       3.2Gi        24Gi
Swap:             0B          0B          0B

Attempting to allocate 1.3000000000000003GB tensor...
Success! Current allocated: 1.30GB
               total        used        free      shared  buff/cache   available
Mem:            30Gi       6.1Gi        22Gi       460Mi       3.2Gi        24Gi
Swap:             0B          0B          0B

Attempting to allocate 1.4000000000000004GB tensor...
Success! Current allocated: 1.40GB
               total        used        free      shared  buff/cache   available
Mem:            30Gi       6.1Gi        22Gi       460Mi       3.2Gi        24Gi
Swap:             0B          0B          0B

Attempting to allocate 1.5000000000000004GB tensor...
Success! Current allocated: 1.50GB
               total        used        free      shared  buff/cache   available
Mem:            30Gi       6.1Gi        22Gi       460Mi       3.2Gi        24Gi
Swap:             0B          0B          0B

Attempting to allocate 1.6000000000000005GB tensor...
Success! Current allocated: 1.60GB
               total        used        free      shared  buff/cache   available
Mem:            30Gi       6.9Gi        21Gi       460Mi       3.2Gi        23Gi
Swap:             0B          0B          0B

Attempting to allocate 1.7000000000000006GB tensor...
Success! Current allocated: 1.70GB
               total        used        free      shared  buff/cache   available
Mem:            30Gi       6.9Gi        21Gi       460Mi       3.2Gi        23Gi
Swap:             0B          0B          0B

Attempting to allocate 1.8000000000000007GB tensor...
Success! Current allocated: 1.80GB
               total        used        free      shared  buff/cache   available
Mem:            30Gi       7.3Gi        21Gi       460Mi       3.2Gi        23Gi
Swap:             0B          0B          0B

Attempting to allocate 1.9000000000000008GB tensor...
Success! Current allocated: 1.90GB
               total        used        free      shared  buff/cache   available
Mem:            30Gi       7.3Gi        21Gi       460Mi       3.2Gi        23Gi
Swap:             0B          0B          0B

Attempting to allocate 2.000000000000001GB tensor...
Success! Current allocated: 2.00GB
               total        used        free      shared  buff/cache   available
Mem:            30Gi       7.3Gi        21Gi       460Mi       3.2Gi        23Gi
Swap:             0B          0B          0B

Attempting to allocate 2.100000000000001GB tensor...
Success! Current allocated: 2.10GB
               total        used        free      shared  buff/cache   available
Mem:            30Gi       7.3Gi        21Gi       460Mi       3.2Gi        23Gi
Swap:             0B          0B          0B

Attempting to allocate 2.200000000000001GB tensor...
Success! Current allocated: 2.20GB
               total        used        free      shared  buff/cache   available
Mem:            30Gi       7.3Gi        21Gi       460Mi       3.2Gi        23Gi
Swap:             0B          0B          0B

Attempting to allocate 2.300000000000001GB tensor...
Success! Current allocated: 2.30GB
               total        used        free      shared  buff/cache   available
Mem:            30Gi       8.3Gi        20Gi       460Mi       3.2Gi        22Gi
Swap:             0B          0B          0B

Attempting to allocate 2.4000000000000012GB tensor...
Success! Current allocated: 2.40GB
               total        used        free      shared  buff/cache   available
Mem:            30Gi       8.3Gi        20Gi       460Mi       3.2Gi        22Gi
Swap:             0B          0B          0B

Attempting to allocate 2.5000000000000013GB tensor...
Success! Current allocated: 2.50GB
               total        used        free      shared  buff/cache   available
Mem:            30Gi       8.3Gi        20Gi       460Mi       3.2Gi        22Gi
Swap:             0B          0B          0B

Attempting to allocate 2.6000000000000014GB tensor...
Success! Current allocated: 2.60GB
               total        used        free      shared  buff/cache   available
Mem:            30Gi       8.3Gi        20Gi       460Mi       3.2Gi        22Gi
Swap:             0B          0B          0B

Attempting to allocate 2.7000000000000015GB tensor...
Success! Current allocated: 2.70GB
               total        used        free      shared  buff/cache   available
Mem:            30Gi       9.1Gi        19Gi       460Mi       3.2Gi        21Gi
Swap:             0B          0B          0B

Attempting to allocate 2.8000000000000016GB tensor...
Success! Current allocated: 2.80GB
               total        used        free      shared  buff/cache   available
Mem:            30Gi       9.3Gi        19Gi       460Mi       3.2Gi        21Gi
Swap:             0B          0B          0B

Attempting to allocate 2.9000000000000017GB tensor...
Success! Current allocated: 2.90GB
               total        used        free      shared  buff/cache   available
Mem:            30Gi       9.3Gi        19Gi       460Mi       3.2Gi        21Gi
Swap:             0B          0B          0B

Attempting to allocate 3.0000000000000018GB tensor...
Success! Current allocated: 3.00GB
               total        used        free      shared  buff/cache   available
Mem:            30Gi       9.3Gi        19Gi       460Mi       3.2Gi        21Gi
Swap:             0B          0B          0B

Attempting to allocate 3.100000000000002GB tensor...
Success! Current allocated: 3.10GB
               total        used        free      shared  buff/cache   available
Mem:            30Gi       9.3Gi        19Gi       460Mi       3.2Gi        21Gi
Swap:             0B          0B          0B

Attempting to allocate 3.200000000000002GB tensor...
Success! Current allocated: 3.20GB
               total        used        free      shared  buff/cache   available
Mem:            30Gi       9.3Gi        19Gi       460Mi       3.2Gi        21Gi
Swap:             0B          0B          0B

Attempting to allocate 3.300000000000002GB tensor...
Success! Current allocated: 3.30GB
               total        used        free      shared  buff/cache   available
Mem:            30Gi       9.3Gi        19Gi       460Mi       3.2Gi        21Gi
Swap:             0B          0B          0B

Attempting to allocate 3.400000000000002GB tensor...
Success! Current allocated: 3.40GB
               total        used        free      shared  buff/cache   available
Mem:            30Gi       9.3Gi        19Gi       460Mi       3.2Gi        21Gi
Swap:             0B          0B          0B

Attempting to allocate 3.500000000000002GB tensor...
Success! Current allocated: 3.50GB
               total        used        free      shared  buff/cache   available
Mem:            30Gi       9.4Gi        19Gi       460Mi       3.2Gi        21Gi
Swap:             0B          0B          0B

Attempting to allocate 3.6000000000000023GB tensor...
Success! Current allocated: 3.60GB
               total        used        free      shared  buff/cache   available
Mem:            30Gi       9.4Gi        19Gi       460Mi       3.2Gi        21Gi
Swap:             0B          0B          0B

Attempting to allocate 3.7000000000000024GB tensor...
Success! Current allocated: 3.70GB
               total        used        free      shared  buff/cache   available
Mem:            30Gi       9.4Gi        19Gi       460Mi       3.2Gi        21Gi
Swap:             0B          0B          0B

Attempting to allocate 3.8000000000000025GB tensor...
Success! Current allocated: 3.80GB
               total        used        free      shared  buff/cache   available
Mem:            30Gi       9.4Gi        19Gi       460Mi       3.2Gi        21Gi
Swap:             0B          0B          0B

Attempting to allocate 3.9000000000000026GB tensor...
Success! Current allocated: 3.90GB
               total        used        free      shared  buff/cache   available
Mem:            30Gi       9.4Gi        19Gi       460Mi       3.2Gi        21Gi
Swap:             0B          0B          0B

Attempting to allocate 4.000000000000003GB tensor...

Allocation failed at 4.000000000000003GB (last success: 3.9000000000000026GB)
Error message: XPU out of memory. Tried to allocate 4.00 GiB. GPU 0 has a total capacity of 14.06 GiB. Of the allocated memory 0 bytes is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. Please use `empty_cache` to release all unoccupied cached memory.
               total        used        free      shared  buff/cache   available
Mem:            30Gi       9.4Gi        19Gi       460Mi       3.2Gi        21Gi
Swap:             0B          0B          0B

! Final allocated memory: 0.00GB
Test completed.
(finetune) [libchara@libchara-ThinkPad-T14p-Gen-2 finetune_intel]$ 

Here’s the script:

import torch
from torch import xpu
import os

def xpu_memory_test():
    if not xpu.is_available():
        print("XPU not available!")
        return
    
    device = torch.device("xpu")
    print(f"\n=== XPU Memory Test ===")
    
    try:
        print(f"Device Name: {torch.xpu.get_device_name(device)}")

        max_alloc = torch.xpu.max_memory_allocated(device) / (1024**3)
        total_mem = torch.xpu.get_device_properties(device).total_memory / (1024**3)
        
        print(f"\nDevice Memory: {total_mem:.2f}GB total")
        print(f"Max allocated during session: {max_alloc:.2f}GB")

        size_step = 0.1
        current_size = 1.0
        last_success = 0
        
        while current_size <= total_mem:
            tensor_size = int(current_size * (1024**3 / 4))
            print(f"\nAttempting to allocate {current_size}GB tensor...")
            
            try:
                test_tensor = torch.empty(tensor_size, dtype=torch.float32, device=device)
                torch.xpu.synchronize(device)

                allocated = torch.xpu.memory_allocated(device) / (1024**3)
                print(f"Success! Current allocated: {allocated:.2f}GB")
                os.system("free -h")
                
                del test_tensor
                torch.xpu.empty_cache()
                last_success = current_size
                current_size += size_step
                
            except RuntimeError as e:
                print(f"\nAllocation failed at {current_size}GB (last success: {last_success}GB)")
                print(f"Error message: {str(e)}")
                os.system("free -h")
                break
                
    except Exception as e:
        print(f"\nError during memory test: {str(e)}")
        
    finally:
        allocated = torch.xpu.memory_allocated(device) / (1024**3)
        print(f"\n! Final allocated memory: {allocated:.2f}GB")
        print("Test completed.")

if __name__ == "__main__":
    xpu_memory_test()
    torch.xpu.empty_cache()

I’ve got the same issue on my Laptop Asus Vivobook S 15
CPU: Intel Core Ultra 7 155H
RAM: 32 GB
GPU: Intel Arc Graphics
OS: Windows 11

Allocation failed at 4.000000000000003GB (last success: 3.9000000000000026GB)
Error message: XPU out of memory. Tried to allocate 4.00 GiB. GPU 0 has a total capacity of 16.44 GiB. Of the allocated memory 0 bytes is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. Please use empty_cache to release all unoccupied cached memory.
‘free’ is not recognized as an internal or external command,
operable program or batch file.

! Final allocated memory: 0.00GB
Test completed.