I have an extremely large tensor in my disk whose dimension is [3000000, 128, 768], it cannot be loaded entirely into memory at once because the code crashed. Is there a way to load only specific rows based on the provided indices using the current DataLoader? For example, only load [1, 55555, 2673], three rows from that large tensor in the disk to memory? Thanks!
You could use mmap
in torch.load
as described in the docs:
mmap (Optional[ bool]) – Indicates whether the file should be mmaped rather than loading all the storages into memory. Typically, tensor storages in the file will first be moved from disk to CPU memory, after which they are moved to the location that they were tagged with when saving, or specified by
map_location
. This second step is a no-op if the final location is CPU. When themmap
flag is set, instead of copying the tensor storages from disk to CPU memory in the first step,f
is mmaped.
Hello ptrblck,
Thanks for your reply. Actually I tried to directly use mmap as follows:
import torch
import mmap
import numpy as np
tensor_size = 1000
tensor = torch.randn(1000, 128, 768)
print("tensor", tensor[torch.tensor([166, 896, 66, 888, 1])])
torch.save(tensor, '/test.pt')
# Path to the file containing the tensor
file_path = '/test.pt'
# Specify the indices of rows to select
indices = [166, 896, 66, 888, 1]
# Define the tensor shape
num_rows = len(indices)
row_size = 128*768 # Assuming each row has a shape of [128, 768]
# Load specific rows using mmap
selected_rows = []
with open(file_path, 'r+b') as file:
# Memory map the file
mmapped_file = mmap.mmap(file.fileno(), length=0, access=mmap.ACCESS_READ)
# Calculate the byte size of each row
row_byte_size = row_size * np.dtype(np.float32).itemsize
# Iterate through indices and read specific rows
for idx in indices:
mmapped_file.seek(idx * row_byte_size)
row_data = np.frombuffer(mmapped_file.read(row_byte_size), dtype=np.float32)
row_tensor = torch.tensor(row_data, dtype=torch.float32).view(128, 768)
selected_rows.append(row_tensor)
# Concatenate selected rows into a new tensor
concatenated_tensor = torch.stack(selected_rows, dim=0)
# Confirm the shape of the new tensor
print(concatenated_tensor) # This should result in [5, 128, 768]
However, the two printed results are not same, could I have some further advices for this issue? Thanks again.
I don’t think you can directly use mmap.mmap
on a tensor as the saved file is a archive containing some metadata as well as the actual tensor data, while you are expecting to read the binary data directly from offset 0.