Get file names and file path using PyTorch dataloader

grid_world · June 24, 2021, 8:55am

I am using PyTorch 1.8 and Python 3.8 to read images from a folder using the following code:

print(f"PyTorch version: {torch.__version__}")
# PyTorch version: 1.8.1

# Device configuration-
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"currently available device: {device}")
# currently available device: cpu


# Define transformations for training and test sets-
transform_train = transforms.Compose(
    [
      # transforms.RandomCrop(32, padding = 4),
      # transforms.RandomHorizontalFlip(),
      transforms.ToTensor(),
      # transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
     ]
     )

transform_test = transforms.Compose(
    [
      transforms.ToTensor(),
      # transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
     ]
     )

# Define directory containing images-
data_dir = 'My_Datasets/Cat_Dog_data/'

# Define datasets-
train_data = datasets.ImageFolder(data_dir + '/train', 
                                  transform = train_transforms)
test_data = datasets.ImageFolder(data_dir + '/test', 
                                 transform = test_transforms)

print(f"number of train images = {len(train_data)} & number of validation images = {len(test_data)}")
# number of train images = 22500 & number of validation images = 2500

print(f"number of training classes = {len(train_data.classes)} & number of validation classes = {len(test_data.classes)}")
# number of training classes = 2 & number of validation classes = 2

# Define data loaders-
trainloader = torch.utils.data.DataLoader(train_data, batch_size = 32)
testloader = torch.utils.data.DataLoader(test_data, batch_size = 32)

len(trainloader), len(testloader)
# (704, 79)

# Sanity check-
len(train_data) / 32, len(test_data) / 32

You can iterate through the train data using ‘train_loader’ as follows:

for img, lab in train_loader:
   print(img.shape, lab.shape)
   pass

However, I am interested in getting the file name along with the file path from which the file was read. How can I achieve this?

Thanks!

Guys, I am sorry for using CIFAR-10 dataset when in fact I want to read images from local system folders. The modified code is above.

ptrblck · June 24, 2021, 9:42am

The CIFAR10 dataset doesn’t download all images separately, but the binary data as seen here, so you won’t be able to return paths to each image.
However, in other datasets, which lazily load each image file, you can just return the path with the data and target tensors.

rahulvigneswaran · June 24, 2021, 10:21am

I think we can achieve it by doing the following, @grid_world @ptrblck

class your_new_custom_dataset_class(torchvision.datasets.CIFAR10):  #<----Important
    def __init__(self):
        super(your_new_custom_dataset_class, self).__init__()
        self.get_filenames()    #<----Important
    
    def __getitem__(self, index):
        img, label = self.data[index], self.labels[index]
        filename = self.filenames[index]       #<----Important

        img = Image.fromarray(img)

        if self.transform is not None:
            img = self.transform(img)

        if self.target_transform is not None:
            label = self.target_transform(label)

        return img, label, filename      #<----Important
    
    def get_filenames(self):    #<----Important
        self.filenames = []
        if self.train:
            downloaded_list = self.train_list
        else:
            downloaded_list = self.test_list
            
        for file_name, checksum in downloaded_list:
            file_path = os.path.join(self.root, self.base_folder, file_name)
            with open(file_path, 'rb') as f:
                self.entry = pickle.load(f, encoding='latin1')
            self.filenames.extend(self.entry["filenames"])

PS. The implementation is a bit sloppy, just make sure to note all the lines with the #<----Important comment.

grid_world · June 24, 2021, 11:18am

Apologies for using CIFAR10 dataset. I am interested in getting image file names and path for reading images from local system folder. The modified code is above.

grid_world · June 24, 2021, 11:18am

Rather than using CIFAR10 dataset, apologies for that, I am interested in getting image file names and path for reading images from local system folder. The modified code is above.

Ruy_Diaz · May 16, 2022, 4:08pm

In a similar case I found this useful:

dataloader_.sampler.data_source.dataset.imgs

It gave me a list with tuples: path_and_filename, class

venki_lfc · February 17, 2023, 9:45am

Exactly what I needed. Thank you @Ruy_Diaz

DJF · August 10, 2023, 7:30pm

Just be careful anyone using dataloader_.sampler.data_source.dataset.imgs on a subsetted dataloader. The data source is unaffected, so this gives the full set.