Metadata in dataloader (such as file names)

JimS · June 21, 2023, 9:10pm

Is there a way to save the file name for each file in the test and train data set into the data structure dataloader creates?

For example, if I retrieve a particular piece of data from dataloader can I get the filename that particular piece of data was created from?

I am doing image analysis and I would like to be able to go back to the original image file to compare (1) any manipulation done on the image on loading such as normalization, and (2) to compare predictions with metadata available for the original image.

train_data = datasets.ImageFolder(DEST_PATH+‘train/’, transform = transform)
train_data_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True, num_workers=2)
for i, (images, lbls) in enumerate(train_data_loader, 0):

At this point, is there a way to get the filename that created “i, (images,lbls)”?

Thanks!

ptrblck · June 22, 2023, 6:46am

The DataLoader doesn’t have any knowledge about the dataset besides what is being returned by Dataset.__getitem__. You could thus return the image names in the __getitem__ method additionally to the data and target, and could then use it in the DataLoader loop.

JimS · June 22, 2023, 4:51pm

Thanks for the response!
So, I would make a local/custom version of DataLoader with the changes in the getitem method?

ptrblck · June 22, 2023, 5:12pm

You would create a custom Dataset (not DataLoader) which would then return the image names in its __getitem__.

Rodrigo_Santos · February 22, 2024, 4:36pm

Hi Coders, I just created a fully functional snipped for this example. I created an instance of FashionMnist Dataset and set a new getitem for this instance. It works fine when I loop over the instance. However, whenever I use the Dataloader it seems not fetch the metadata. Would you ( @ptrblck ) please help me?

from types import MethodType
from typing import Any, Tuple
from PIL import Image
from torchvision.transforms import ToTensor
from torch.utils.data import DataLoader
from torchvision import datasets
import torch
import copy


################# replace methods... ##########################

def __custom_getitem__(self, index: int) -> Tuple[Any, Any, Any]:
        """
        Args:
            index (int): Index

        Returns:
            tuple: (image, target, metadata).
        """
        img, target  = self.data[index], int(self.targets[index]), 
        metadata = self.metadatas[index]

        # doing this so that it is consistent with all other datasets
        # to return a PIL Image
        img = Image.fromarray(img.numpy(), mode="L")
        
        if self.transform is not None:
            img = self.transform(img)
            

        if self.target_transform is not None:
            target = self.target_transform(target)
        
        
        return img, target, metadata


####################################################
################# Testing ##########################
####################################################
# loadgind dataset
training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor())

################# Replacing __get ##########################
BATCH_SIZE=32
training_data2=copy.deepcopy(training_data)
training_data2.metadatas=torch.zeros(len(training_data))
training_data2.__getitem__= MethodType(__custom_getitem__, training_data2)
training_loader2 = DataLoader(dataset=training_data2, 
                          batch_size=BATCH_SIZE, 
                          shuffle=True,)

# Works fine
print("Using new __getitem__ directly on my instance: length=",
      len(training_data2.__getitem__(1)))

print('\n')

# metadata is not returned in the dataloader

for data in training_loader2:
    print("Iteration returned by dataloader: length=",len(data))
    print("Image shape: ",data[0].shape)
    print("labels shape: ",data[1].shape)
    print("labels shape: ",data[2].shape)
    break

Rodrigo_Santos · February 22, 2024, 5:21pm

Creating a new class, it works fine… forget about my help asking heheh…

from types import MethodType
from typing import Any, Callable, Dict, List, Optional, Tuple
from PIL import Image
from torchvision import transforms
from torchvision.transforms import ToTensor
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
from torchvision import datasets
import torch
import copy
from torch.utils.data import Dataset, DataLoader

############################# Writing a new class ####################################

class dataset_with_metadata(Dataset):
       
    def __init__(self, data, targets, metadatas, transform=None,target_transform=None):
        """
        Arguments:
            data: Fashion Mnist images
            targers: classes
            metadata : 
            transform (callable, optional): Optional transform to be applied
                on a sample.
        """
        self.data = data
        self.targets = targets
        self.metadatas = metadatas
        self.transform = transform
        self.target_transform = target_transform

    def __len__(self):
        return len(self.data)

    def __getitem__(self,  index: int):
        
        img, target  = self.data[index], int(self.targets[index]), 
        metadata = self.metadatas[index]

        # doing this so that it is consistent with all other datasets
        # to return a PIL Image
        img = Image.fromarray(img.numpy(), mode="L")
        
        if self.transform is not None:
            img = self.transform(img)
            

        if self.target_transform is not None:
            target = self.target_transform(target)
        
        
        return img, target, metadata

################# loading dataset ##########################    
training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True)

################# Creating Dataset instance ##########################

teste=torch.zeros(len(training_data))
training_data2=FashionMNIST_symetric_dataset(data=training_data.data,
                                            targets=training_data.targets,
                                            metadatas=teste,
                                             transform=ToTensor()
                                            )

################# Creating Dataload instance ##########################

BATCH_SIZE=32
training_loader2 = DataLoader(dataset=training_data2, 
                          batch_size=BATCH_SIZE, 
                          shuffle=True,)

# Works fine
print("Using new __getitem__ directly on my instance: length=",
      len(training_data2.__getitem__(1)))

print('\n')

# now works fine too

for data in training_loader2:
    print("Iteration returned by dataloader: length=",len(data))
    print("Image shape: ",data[0].shape)
    print("labels shape: ",data[1].shape)
    print("labels shape: ",data[2].shape)
    break