Image Folder with many non-class subfolders

mkutu · March 10, 2020, 7:28pm

Hi,
I’m building a image classifier using trail camera data. I’m trying to create a custom dataset by loading images from a directory with many sub folders organized by capture location and date and not by classes.

for example,.

/data
    /COL
            /COLE
                        /CO_LE_05_190211
                                     /CLE-COLE05_00001_1-6-2019.jpg
                                     /CLE-COLE05_00001_1-25-2019.jpg
                                     .
                                     .

                        /CO_LE_05_190227
                                     /CLE-COLE05_00002_2-5-2019.jpg
                                     /CLE-COLE05_00002_2-6-2019.jpg
                                     .
                            
                        /CO_LE_05_190404
                                     /CLE-COLE05_00003_4-7-2019.jpg
                                     /CLE-COLE05_00003_4-13-2019.jpg
                                     .   
                                     .

I have a csv file with image paths and their respective label so I could take a subset of the dataset and structure sub-folders based on each class, however, doing so would limit accuracy, inferencing, etc.

I have over 3 million images and have found that re-organizing images by class would not be possible considering my memory limitations and time needed to do so.

I was wondering if anyone could suggest a custom dataset loader that would allow me to reference the csv with image path and label info, while retaining the current folder structure?

any help is greatly appreciated.

cheers,
mkutu

ptrblck · March 11, 2020, 2:58am

I think the approach of using your already valid .csv file and keeping the folder structure sounds reasonable and I would avoid moving around the data.

To use the csv data, I would recommend to load it in your Dataset.__init__ method with e.g. pandas via pd.read_csv and use the index in Dataset.__getitem__(self, index) to get the current file path and label.
I’m not sure how your csv is structured, but you can index the rows and columns of the pd.DataFrame easily.
Once you have the image path, you could load it with PIL and transform it to a tensor.
The label will most likely be returned as a numpy array/scalar, so you could use torch.from_numpy() to create the target tensor.

Omnia_Al-wazzan · February 16, 2023, 10:00pm

Hi @ptrblck I have a similar problem to @mkutu

I have a breast cancer images dataset which is arranged as the following:

Data |---- ID_1 — 2016 —1-N |1-1.dcm
|---- ID_1 — 2016 —1-N |1-2.dcm
|
|---- ID_2 — 2016 —1-N |1-1.dcm
|---- ID_2 — 2016 —1-N |1-2.dcm
|
|
|---- ID_3 — 2016 —1-N |1-1.dcm
|---- ID_3 — 2016 —1-N |1-2.dcm
|---- ID_3 — 2016 —1-N |1-3.dcm
|---- ID_3 — 2016 —1-N |1-4.dcm

And I have a CSV file that has the patient ID which is the same name as the first folder in the above format, for example:

I would like to read the image along with its label from the csv, is that possible? I am confused about how to use the os.walk() function inside my dataloader.

I would greatly appreciate it if you could point me in the right direction

Thanks

ptrblck · February 17, 2023, 3:02am

You wouldn’t implement any data loading loading into the DataLoader but inside your custom Dataset.
The main idea would be the same as previously described:
In the Dataset.__init__ method you would store the paths to each sample by processing the CSV file.
To do so you could read the CSV file via pandas and create the paths using the root folder etc.
Once this is done you would then store all paths in e.g. a list and index it in the __getitem__ to load each sample.

Omnia_Al-wazzan · February 17, 2023, 5:16pm

As usual, your feedback is always constructive, thanks a lot.

Here I followed your suggestion:


import numpy as np
import torch
from torchvision import transforms
import random
from pathlib import Path
#from typing import Tuple, Dict, List
import pydicom
import glob
import sys
import os
from torch.utils.data import Dataset,DataLoader
from pydicom.pixel_data_handlers.util import apply_voi_lut
import matplotlib.pyplot as plt
NUM_WORKERS=0
PIN_MEMORY=True    



transform_1 =  transforms.Compose([transforms.ToTensor()])

def read_xray(path, voi_lut = True, fix_monochrome = True):
    dicom = pydicom.read_file(path)
    
    # VOI LUT (if available by DICOM device) is used to transform raw DICOM data to "human-friendly" view
    if voi_lut:
        data = apply_voi_lut(dicom.pixel_array, dicom)
    else:
        data = dicom.pixel_array
               
    # depending on this value, MRI may look inverted - fix that:
    if fix_monochrome and dicom.PhotometricInterpretation == "MONOCHROME2":
        data = np.amax(data) - data
        
    data = data - np.min(data)
    data = data / np.max(data)
    data = (data * 255).astype(np.uint8)
        
    return data



class Dataset_(Dataset):
    def __init__(self, image_dir,transform=None):
        self.image_dir = image_dir
       # self.genes=path_genes_data
        self.images = list(image_dir.glob("*/*/*/*.dcm"))
        self.transform = transform

    def __len__(self):
        return len(self.images)

    def __getitem__(self, index):
        ## reading image ###
        img_path = os.path.join(self.image_dir, self.images[index])
        image = read_xray(img_path)
        #plt.figure(figsize = (12,12))
        #plt.imshow(image, 'gray')
        image_name=self.images[index]
        if self.transform is not None:
            a = self.transform(image=image)
            image = a['image']
            #image=np.transpose(image, (2, 0, 1))
            
        return image,self.images[index]



 
def Data_Loader( test_dir,batch_size,num_workers=NUM_WORKERS,pin_memory=PIN_MEMORY):
    
    load_data = Dataset_( image_dir=test_dir,transform=transform_1)

    data_loader = DataLoader(load_data,batch_size=batch_size,num_workers=num_workers,pin_memory=pin_memory,shuffle=True)
    
    return data_loader


data_path = Path(r"E:\IAAA_CMMD\manifest-1616439774456")
images_folder = data_path / "CMMD"
loader=Data_Loader(images_folder,4)
a=iter(loader)
a1=next(a)

Though, I can plot the images and see them, which might tell that the data is uploading, however I get this error when trying to verify using a1 = next(a)

runfile('E:/IAAA_CMMD/untitled7.py', wdir='E:/IAAA_CMMD')
Traceback (most recent call last):

  File E:\IAAA_CMMD\untitled7.py:83 in <module>
    a1=next(a)

  File ~\anaconda3\lib\site-packages\torch\utils\data\dataloader.py:530 in __next__
    data = self._next_data()

  File ~\anaconda3\lib\site-packages\torch\utils\data\dataloader.py:570 in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration

  File ~\anaconda3\lib\site-packages\torch\utils\data\_utils\fetch.py:49 in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]

  File ~\anaconda3\lib\site-packages\torch\utils\data\_utils\fetch.py:49 in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]

  File E:\IAAA_CMMD\untitled7.py:61 in __getitem__
    a = self.transform(image=image)

TypeError: __call__() got an unexpected keyword argument 'image'

Did I make something wrong in the transformation? I can’t tell where the error is

ptrblck · February 17, 2023, 6:39pm

ToTensor does not expect an image argument:

transform = transforms.ToTensor()
img = transforms.ToPILImage()(torch.randn(3, 224, 224))

out = transform(img) # works
out = transform(image=img)
# TypeError: __call__() got an unexpected keyword argument 'image'

Omnia_Al-wazzan · February 18, 2023, 11:10pm

Thanks a lot for your time and effort!

Great to know that.

It’s working now.