How to load tensors from a few pt files lazily into neural network DataLoader

registerrug · July 20, 2022, 3:24pm

I currently have 11 pt files of size “torch.Size([1000000, 3, 50, 40])”. Each tensor for the cnn is 3x50x40. Each pt file has 1MM of these tensors. I cannot combined them due to memory limitations and I do not want to save them as 11MM individual pt files. Can anyone help me understand how to get these into a DataLoader?

With a smaller dataset I have used:

data_tensor = torch.load('tensor_1.pt')  
dataset = torch.utils.data.TensorDataset(data_tensor, target_tensor)
train_set, val_set, test_set = random_split(dataset, [int(size*.8), int(size*.1), size-int(size*.8)-int(size*.1)])
train_loader = DataLoader(train_set, batch_size=128, num_workers=4, shuffle=True)

but with the size of these files this will not work. Thank you!

ptrblck · July 20, 2022, 7:39pm

Each file should take 1000000 * 3 * 50 * 40 * 4 / 1024**3 ~= 22.3GB of memory. I don’t know if you could load parts of this tensor into your memory in case loading the entire file already runs into the OOM.
In numpy you could use the mmap_mode to keep the file on the disk and load slices only.

registerrug · July 21, 2022, 1:38pm

Thank you @ptrblck. This file size runs smoothly. I just cant figure out how to load 11 of them only when needed. Any guidance is helpful.

BTW I have learned more from reading your responses on others posts over the past year than any online forum. Thank you for so consistently responding to questions!

ptrblck · July 21, 2022, 10:33pm

A simple approach would be to recreate the dataset as a new TensorDataset for each file, wrap it into a DataLoader, and train with it.
You could also write a custom Dataset and open a single file at a time in the __getitem__ and hold to it until all samples were used.

registerrug · July 22, 2022, 1:46pm

By “wrap it into a DataLoader” do you mean by using ConcatDataset or is there another way to do this? I have never pulled two datasets into one DataLoader.

ptrblck · July 22, 2022, 5:07pm

No, I wouldn’t use a ConcatDataset since you would need to preload all datasets, which wouldn’t fit into your RAM.
I was thinking about this simple approach:

for i in range(10):
    data_tensor = torch.load('tensor_{}.pt'.format(i))  
    dataset = torch.utils.data.TensorDataset(data_tensor, target_tensor)
    train_set, val_set, test_set = random_split(dataset, [int(size*.8), int(size*.1), size-int(size*.8)-int(size*.1)])
    train_loader = DataLoader(train_set, batch_size=128, num_workers=4, shuffle=True)
    
    # train
    for data, target in train_loader:
        ...

Probably not the most elegant approach as you could hide this logic in a custom Dataset, but it might just do its job.