I have hundreds of CSV files that each contain hundreds of megabytes of data. To create a class that inherits from PyTorch’s Dataset the getitem method must access a single sample at a time, where the i
parameter of the function indicates the index of the sample. However, to perform lazy loading my class just saves the name of each file instead of saving all the data in memory.
So far so good, but I’m having doubts about how to access a single data in the __getitem__
method, in my searches I found the following strategies:
- In
__init__
save the number of samples that each file has, then in getitem load the file with pandas and access the respective line. It is possible to know which file to access by the id received as a parameter, for example: having 2 files, where each has 100 samples, id 100 would represent the first sample of the second file, id 101 would represent the second sample of the second file and so on against. However, this approach has an obvious problem which is the number of IOs needed. - Perform the same process presented in 1, but store files in a cache as long as it fits in memory. This approach seems to be better than the first as it reduces the number of IOs as some files will be in memory. The problem with this approach is how the caching policy will be carried out, I thought of using something like
dask
to handle this, butiloc
is quite inefficient according to the documentation.
Does anyone have a better idea or indication of some material?