How to use dataloader with large dataset?

Hi,

I have 300G size h5 file that has images, texts.

and I’m having trouble with dataloader. for example, in the code,

class loader(Dataset):
    def __init__():
        self.file = load(my_300G_file)
    def __getitem(self, index):
        return self.file[index]

Will I get memory error? because I load 300g file? (my ram is 8G)
If so, how can i load dataset?

If you try to load the whole file at once, your system will run out of memory.
You could use pandas to load slices:

df = pd.DataFrame(np.random.rand(100, 1))
df.to_hdf('file.h5', 'df', format='table')
pd.read_hdf('file.h5', start=0, stop=1)

You could use the index to set the start and stop values accordingly.

@ptrblck

Thanks, so the code I wrote will not work because of memory right??
Then, Should I use the pandas load method when I define dataset class?

class loader(Dataset):
    def __init__():
        df = pd.DataFrame(np.random.rand(100, 1))
        df.to_hdf('file.h5', 'df', format='table')
        pd.read_hdf('file.h5', start=0, stop=1)

like this?

No, I just created a dummy dataframe and saved it to Disc so that I could load it with pd.read_hdf.
You should add the read line to your __getitem__ method and store the file path in __init__.

I see,
then if i just store the file path in __init__ and call indexed h5 in __getitem__ like [0:10]
then computer will just assign [0:10] data memory not the whole h5 file.
am i right?

Yes, that’s the plan. You would need to use the method I’ve posted as you can’t directly index the h5 file.

1 Like

Hi, I have a similar scenario. I analyse genomic data and each sequence can be represented in a vector of about 50-1000 dimensions. Larger the better. So in gist;
I generate features using a different pipeline and gives a text file, vector per line.

In this case, do I have to save vectors to an h5 file somehow? Any help would be greatly appreciated.