How to load many numpy arrays to DataLoader from disk?

Is there more efficient way to load all .npy arrays to DataLoader ?
this doesn’t work when there are too many arrays:

files = glob.glob('./*)

all_inputs = torch.tensor([])
all_labels = torch.tensor([])

for i in range(len(files)):
    input = torch.tensor(np.load('imgs_' + str(i) + '.npy'))
    all_inputs = torch.cat([all_inputs, input], dim=0)
    print(all_inputs.shape)

    labels = torch.tensor(np.load('labels_' + str(i) + '.npy'))
    all_labels = torch.cat([all_labels, labels], dim=0)
    print(all_labels.shape)


dataset = TensorDataset(all_inputs, all_labels)
dataloader = DataLoader(dataset)

You could write a custom Dataset and lazily load each numpy array in the __getitem__. This tutorial might be a good starter.

Is there good ways to do transforms on numpy arrays that have say 10 channels using this method?

You could convert the numpy.array to a tensor and apply transforms on it, as transforms also support tensors:

transform = transforms.Resize((20, 20))
x = np.random.randn(10, 200, 200)
out = transform(torch.from_numpy(x))
print(out.shape)
# torch.Size([10, 20, 20])
1 Like