Create DataLoader from list of NumPy arrays

lorenzo_fabbri · May 16, 2019, 2:15pm

I’m trying to build a simple CNN where the input is a list of NumPy arrays and the target is a list of real numbers (regression problem).

I’m stuck when I try to create the DataLoader.

Suppose Xp_train and yp_train are two Python lists that contain NumPy arrays. Currently I’m using the following code:

tensor_Xp_train = torch.stack([torch.Tensor(el) for el in Xp_train])
tensor_yp_train = torch.stack([torch.Tensor(el) for el in yp_train])
dataset_p_train = TensorDataset(tensor_Xp_train, tensor_yp_train)
loader_p_train = DataLoader(dataset_p_train)

The problem is that it’s not working since Xp_train is a list of 149798 arrays.

Is there a way to create a DataLoader without loading everything into memory?

ptrblck · May 23, 2019, 5:59pm

You could create a Dataset and load the data lazily.
However, if you have already loaded the numpy arrays, they should apparently fit into your RAM.
Try to use torch.from_numpy to reuse the underlying memory and to avoid a copy.

ayrts · August 6, 2020, 2:43am

Sorry for the necrobump, but I am also having a similar issue. I have a huge list of numpy arrays (>100, 000), and I am trying to create a custom Dataset that imitates ImageFolder with two classes (fake and real).

However, I still run out of memory even before training starts. How can I load the data lazily? Below is my code that implements my custom dataset. I call it with CustomDataset(args.data_root), where args.data_root is the root directory of my dataset, then I pass it to the vanilla DataLoader. Any help is appreciated. Thanks!

class CustomDataset(Dataset):
    def __init__(self, data_root, transform=None):
        fake = join(data_root, 'fake')
        real = join(data_root, 'real')
        data=[np.load(join(fake, array)) for array in os.listdir(fake)]
        data.extend([np.load(join(real, array)) for array in os.listdir(real)])

        target=[np.zeros(1, dtype=np.long) for i in range(len([array for array in os.listdir(fake) if os.path.isfile(array)]))]
        target.extend([np.ones(1, dtype=np.long) for i in range(len([array for array in os.listdir(real) if os.path.isfile(array)]))])
        target=[target[i][0] for i in range(len(target))]
        self.data = torch.from_numpy(data)
        self.target = torch.from_numpy(target)
        self.transform = transform

ptrblck · August 8, 2020, 9:55am

To load data lazily, you would have to move the actual loading into Dataset.__getitem__, while you are apparently preloading the complete dataset in the __init__ method:

data=[np.load(join(fake, array)) for array in os.listdir(fake)]
data.extend([np.load(join(real, array)) for array in os.listdir(real)])

While this is easy to do, if you are dealing with samples stored in unique files (e.g. images), you might need to implement more logic, e.g. if each numpy array contains multiple samples.

I.e. you could load each numpy array and return it completely. This approach would basically multiply your batch_size (passed to the DataLoader) with the number of samples per loaded array.
Also, the shuffle option would only shuffle the numpy files, not the samples directly.

ayrts · August 9, 2020, 10:51am

Thanks for the reply Patrick. Each numpy array only contains one sample. It turns out DatasetFolder is exactly what I need because I basically needed a generic implementation of ImageFolder. I implemented it like this:

def npy_loader(path):
                sample = torch.from_numpy(np.load(path))
                return sample
train_set = DatasetFolderWithPaths(root=args.data_root, loader=npy_loader, extensions='.npy', transform= transform)

jdhmr · September 6, 2021, 11:52am

Hi Patrick, sorry for the revival of an old post.

I am using a huge dataset (8 GB) of numpy arrays.

I have tried an lmdb dataset approach, which involved conversion of the arrays into .jpg formats, I haven’t been able to reproduce my previous results on arrays.

The solutions here are ideal, but would you say it is a bad idea to load .npy arrays every minibatch? It seems people don’t tend to do this!

ptrblck · September 6, 2021, 9:45pm

If I understand your use case correctly, you are loading a .npy file for each sample? If so, I don’t think it’s a bad idea or do you have any specific concerns?
I think the usual approach of loading image types (e.g. JPEG) has given you the advantage to apply image transformations on the samples directly, but since torchvision.transforms now also supports transformations on tensors you could also load the arrays/tensors directly.