Image Folder with many non-class subfolders

Hi,
I’m building a image classifier using trail camera data. I’m trying to create a custom dataset by loading images from a directory with many sub folders organized by capture location and date and not by classes.

for example,.

/data
    /COL
            /COLE
                        /CO_LE_05_190211
                                     /CLE-COLE05_00001_1-6-2019.jpg
                                     /CLE-COLE05_00001_1-25-2019.jpg
                                     .
                                     .

                        /CO_LE_05_190227
                                     /CLE-COLE05_00002_2-5-2019.jpg
                                     /CLE-COLE05_00002_2-6-2019.jpg
                                     .
                            
                        /CO_LE_05_190404
                                     /CLE-COLE05_00003_4-7-2019.jpg
                                     /CLE-COLE05_00003_4-13-2019.jpg
                                     .   
                                     .

I have a csv file with image paths and their respective label so I could take a subset of the dataset and structure sub-folders based on each class, however, doing so would limit accuracy, inferencing, etc.

I have over 3 million images and have found that re-organizing images by class would not be possible considering my memory limitations and time needed to do so.

I was wondering if anyone could suggest a custom dataset loader that would allow me to reference the csv with image path and label info, while retaining the current folder structure?

any help is greatly appreciated.

cheers,
mkutu

I think the approach of using your already valid .csv file and keeping the folder structure sounds reasonable and I would avoid moving around the data.

To use the csv data, I would recommend to load it in your Dataset.__init__ method with e.g. pandas via pd.read_csv and use the index in Dataset.__getitem__(self, index) to get the current file path and label.
I’m not sure how your csv is structured, but you can index the rows and columns of the pd.DataFrame easily.
Once you have the image path, you could load it with PIL and transform it to a tensor.
The label will most likely be returned as a numpy array/scalar, so you could use torch.from_numpy() to create the target tensor.