How to load extremely large tensors into memory

J_B_28 · November 13, 2023, 5:51pm

I have an extremely large tensor in my disk whose dimension is [3000000, 128, 768], it cannot be loaded entirely into memory at once because the code crashed. Is there a way to load only specific rows based on the provided indices using the current DataLoader? For example, only load [1, 55555, 2673], three rows from that large tensor in the disk to memory? Thanks!

ptrblck · November 13, 2023, 6:56pm

You could use mmap in torch.load as described in the docs:

mmap (Optional[ bool]) – Indicates whether the file should be mmaped rather than loading all the storages into memory. Typically, tensor storages in the file will first be moved from disk to CPU memory, after which they are moved to the location that they were tagged with when saving, or specified by map_location. This second step is a no-op if the final location is CPU. When the mmap flag is set, instead of copying the tensor storages from disk to CPU memory in the first step, f is mmaped.

J_B_28 · November 14, 2023, 4:49am

Hello ptrblck,

Thanks for your reply. Actually I tried to directly use mmap as follows:

import torch
import mmap
import numpy as np

tensor_size = 1000
tensor = torch.randn(1000, 128, 768)

print("tensor", tensor[torch.tensor([166, 896, 66, 888, 1])])

torch.save(tensor, '/test.pt')

# Path to the file containing the tensor
file_path = '/test.pt'

# Specify the indices of rows to select
indices = [166, 896, 66, 888, 1]

# Define the tensor shape
num_rows = len(indices)
row_size = 128*768  # Assuming each row has a shape of [128, 768]

# Load specific rows using mmap
selected_rows = []
with open(file_path, 'r+b') as file:
    # Memory map the file
    mmapped_file = mmap.mmap(file.fileno(), length=0, access=mmap.ACCESS_READ)

    # Calculate the byte size of each row
    row_byte_size = row_size * np.dtype(np.float32).itemsize

    # Iterate through indices and read specific rows
    for idx in indices:
        mmapped_file.seek(idx * row_byte_size)
        row_data = np.frombuffer(mmapped_file.read(row_byte_size), dtype=np.float32)
        row_tensor = torch.tensor(row_data, dtype=torch.float32).view(128, 768)
        selected_rows.append(row_tensor)

# Concatenate selected rows into a new tensor
concatenated_tensor = torch.stack(selected_rows, dim=0)

# Confirm the shape of the new tensor
print(concatenated_tensor)  # This should result in [5, 128, 768]

However, the two printed results are not same, could I have some further advices for this issue? Thanks again.

ptrblck · November 14, 2023, 7:06pm

I don’t think you can directly use mmap.mmap on a tensor as the saved file is a archive containing some metadata as well as the actual tensor data, while you are expecting to read the binary data directly from offset 0.