I have an extremely large tensor in my disk whose dimension is [3000000, 128, 768], it cannot be loaded entirely into memory at once because the code crashed. Is there a way to load only specific rows based on the provided indices using the current DataLoader? For example, only load [1, 55555, 2673], three rows from that large tensor in the disk to memory? Thanks!
You could use
torch.load as described in the docs:
mmap (Optional[ bool]) – Indicates whether the file should be mmaped rather than loading all the storages into memory. Typically, tensor storages in the file will first be moved from disk to CPU memory, after which they are moved to the location that they were tagged with when saving, or specified by
map_location. This second step is a no-op if the final location is CPU. When the
mmapflag is set, instead of copying the tensor storages from disk to CPU memory in the first step,
Thanks for your reply. Actually I tried to directly use mmap as follows:
import torch import mmap import numpy as np tensor_size = 1000 tensor = torch.randn(1000, 128, 768) print("tensor", tensor[torch.tensor([166, 896, 66, 888, 1])]) torch.save(tensor, '/test.pt') # Path to the file containing the tensor file_path = '/test.pt' # Specify the indices of rows to select indices = [166, 896, 66, 888, 1] # Define the tensor shape num_rows = len(indices) row_size = 128*768 # Assuming each row has a shape of [128, 768] # Load specific rows using mmap selected_rows =  with open(file_path, 'r+b') as file: # Memory map the file mmapped_file = mmap.mmap(file.fileno(), length=0, access=mmap.ACCESS_READ) # Calculate the byte size of each row row_byte_size = row_size * np.dtype(np.float32).itemsize # Iterate through indices and read specific rows for idx in indices: mmapped_file.seek(idx * row_byte_size) row_data = np.frombuffer(mmapped_file.read(row_byte_size), dtype=np.float32) row_tensor = torch.tensor(row_data, dtype=torch.float32).view(128, 768) selected_rows.append(row_tensor) # Concatenate selected rows into a new tensor concatenated_tensor = torch.stack(selected_rows, dim=0) # Confirm the shape of the new tensor print(concatenated_tensor) # This should result in [5, 128, 768]
However, the two printed results are not same, could I have some further advices for this issue? Thanks again.
I don’t think you can directly use
mmap.mmap on a tensor as the saved file is a archive containing some metadata as well as the actual tensor data, while you are expecting to read the binary data directly from offset 0.