I currently have 11 pt files of size “torch.Size([1000000, 3, 50, 40])”. Each tensor for the cnn is 3x50x40. Each pt file has 1MM of these tensors. I cannot combined them due to memory limitations and I do not want to save them as 11MM individual pt files. Can anyone help me understand how to get these into a DataLoader?
With a smaller dataset I have used:
data_tensor = torch.load('tensor_1.pt')
dataset = torch.utils.data.TensorDataset(data_tensor, target_tensor)
train_set, val_set, test_set = random_split(dataset, [int(size*.8), int(size*.1), size-int(size*.8)-int(size*.1)])
train_loader = DataLoader(train_set, batch_size=128, num_workers=4, shuffle=True)
but with the size of these files this will not work. Thank you!
Each file should take 1000000 * 3 * 50 * 40 * 4 / 1024**3 ~= 22.3GB
of memory. I don’t know if you could load parts of this tensor into your memory in case loading the entire file already runs into the OOM.
In numpy
you could use the mmap_mode
to keep the file on the disk and load slices only.
Thank you @ptrblck. This file size runs smoothly. I just cant figure out how to load 11 of them only when needed. Any guidance is helpful.
BTW I have learned more from reading your responses on others posts over the past year than any online forum. Thank you for so consistently responding to questions!
A simple approach would be to recreate the dataset
as a new TensorDataset
for each file, wrap it into a DataLoader
, and train with it.
You could also write a custom Dataset
and open a single file at a time in the __getitem__
and hold to it until all samples were used.
By “wrap it into a DataLoader” do you mean by using ConcatDataset or is there another way to do this? I have never pulled two datasets into one DataLoader.
No, I wouldn’t use a ConcatDataset
since you would need to preload all datasets, which wouldn’t fit into your RAM.
I was thinking about this simple approach:
for i in range(10):
data_tensor = torch.load('tensor_{}.pt'.format(i))
dataset = torch.utils.data.TensorDataset(data_tensor, target_tensor)
train_set, val_set, test_set = random_split(dataset, [int(size*.8), int(size*.1), size-int(size*.8)-int(size*.1)])
train_loader = DataLoader(train_set, batch_size=128, num_workers=4, shuffle=True)
# train
for data, target in train_loader:
...
Probably not the most elegant approach as you could hide this logic in a custom Dataset
, but it might just do its job.
1 Like