Dataset with many large .npz file but limited memory

brianw0924 · November 5, 2021, 3:20am

I have many .npz dataset (images and labels), each is very large, so I can’t load all of them to a dataloader (only 32 GB memory)

since there are multiple npz files, and each have different length (maximum 5000)

what is the best implement way to dynamically load the data?

I only come up with the dumbest way, just store a list of .npz file name , and iterate through it, and then build the dataloader each iteration
e.g.
list = [file1, file2, file3, file4, …]
for file in list:
dataset = …
dataloader = …
training …
del dataset, dataloader

but that result in another problem is that the npz file is so big (roughly 1~3GB), so every time I load the npz file will take huge amount of time.

is there a best solution for this scenario?

Scott_Hoang · November 5, 2021, 3:47am

So each image of your data is already very large?

brianw0924 · November 5, 2021, 4:02am

Many labels are images too (segmentation)

so each of the .npz is very large

Scott_Hoang · November 5, 2021, 5:22am

Can you store your segmentation label at a different scale? Or pre-emptively resize all your data to a smaller resolution?

brianw0924 · November 5, 2021, 7:53am

thanks, I would try this solution. but still figuring out the best solution :’(

I’m currently just iterate through the npz files and build dataloader dynamically while training