Dataset with many large .npz file but limited memory

I have many .npz dataset (images and labels), each is very large, so I can’t load all of them to a dataloader (only 32 GB memory)

since there are multiple npz files, and each have different length (maximum 5000)

what is the best implement way to dynamically load the data?

I only come up with the dumbest way, just store a list of .npz file name , and iterate through it, and then build the dataloader each iteration
e.g.
list = [file1, file2, file3, file4, …]
for file in list:
dataset = …
dataloader = …
training …
del dataset, dataloader

but that result in another problem is that the npz file is so big (roughly 1~3GB), so every time I load the npz file will take huge amount of time.

is there a best solution for this scenario?

So each image of your data is already very large?

Many labels are images too (segmentation)

so each of the .npz is very large

Can you store your segmentation label at a different scale? Or pre-emptively resize all your data to a smaller resolution?

1 Like

thanks, I would try this solution. but still figuring out the best solution :’(

I’m currently just iterate through the npz files and build dataloader dynamically while training