Pin_memory limitation

I can’t pin more than 2gb at ones, to pin more memory I need to break it to multiple chunks.
for example doing this will arise an error

import torch
buffer =  torch.empty(int(3.0*1024**3), dtype=torch.uint8, pin_memory=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

however pinning larger memory by breaking into multiple chunks will work fine

buffer_1 = torch.empty(int(2.0*1024**3), dtype=torch.uint8, pin_memory=True)
buffer_2 = torch.empty(int(2.0*1024**3), dtype=torch.uint8, pin_memory=True)

I don’t know why this is happening, the size of the memory > 1T and there is no limitation on the max locked memory. also it seems that it is occurring with others too, check this discussion
in deepspeed github repo.

more system info:
os: 24.04.1-Ubuntu
gpu : NVIDIA H100
python: 3.12
cuda: 12.8
pytorch: 2.7.0

Not reproducible on my system, so I guess your system might disallow these large allocations:

>>> import torch
>>> buffer =  torch.empty(int(3.0*1024**3), dtype=torch.uint8, pin_memory=True)
>>> buffer.shape
torch.Size([3221225472])

I’m seeing the same issue on my end: allocating a single pinned memory buffer larger than ~2GB throws a CUDA error: invalid argument, but multiple smaller chunks work fine. This points to a limitation on the size of individual contiguous pinned memory allocations, not the total pinned memory available. It may be due to driver or hardware-level constraints, especially with the H100 architecture. Since there are no locked memory limits set, the behavior likely stems from how the CUDA backend handles large pinned memory requests.

would be helpful if you know how to check for such constraints