Building custom dataset, how to return ids as well?

so the format of a custom dataset should be like fllowing:

import torch
from torch.utils import data

class Dataset(data.Dataset):
  'Characterizes a dataset for PyTorch'
  def __init__(self, list_IDs, labels):
        'Initialization'
        self.labels = labels
        self.list_IDs = list_IDs

  def __len__(self):
        'Denotes the total number of samples'
        return len(self.list_IDs)

  def __getitem__(self, index):
        'Generates one sample of data'
        # Select sample
        ID = self.list_IDs[index]

        # Load data and get label
        X = torch.load('data/' + ID + '.pt')
        y = self.labels[ID]

        return X, y

I like to have have ID information in the output in addition to x and y. So i did return X, y, ID
, but now when I do

data_loader = data.DataLoader(dataset, args.batch_size,
                                  num_workers=args.num_workers,
                                  shuffle=True )

batch_iterator = iter(data_loader)
images, targets, id  = next(batch_iterator)

I receive an error,
anyone knows why?

1 Like

All data returned by a dataset needs to be a tensor, if you want to use the default collate_fn of the Dataloader. You have two options: write a custom collate function and pass it to the dataloader or wrap your ID inside a tensor (which is simpler I guess) and unwrap it outside the dataloader.

1 Like

how can we wrap a string in a tensor? :thinking:

Ah sorry, I implied your ID would be an integer. You cannot wrap a string to a tensor. I could think of some ways to achieve something like that, but it would not be very pytorch-like. If you are interested in these Ways you can PM me.

very weird way but u can just conver string into integers through an asci table and convert them back to an string calling a function.

got from internet

>>> s = 'hi'
>>> [ord(c) for c in s]
[104, 105]

that’s what I thought about too. I also thought about wrapping the loader itself, but one would have to define a new iterator for this. I proposed another method, and if this method works (currently waiting for verification), I will post it here later on.

@isalirezag reported this to work great.

3 Likes

Sorry for late coming, but what I want to ask is does collate_fn can return a dict that some value is string now?

I think that another way to do this without building a custom collate function would be not to return the ID but directly the idx within the getitem implementation (which is numerical and can be treated in batches by the default collate function). Something like:

class Dataset(data.Dataset):
  'Characterizes a dataset for PyTorch'
  def __init__(self, list_IDs, labels, retrun_idx: bool = False):
        'Initialization'
        self.labels = labels
        self.list_IDs = list_IDs
        self.return_idx = return_idx

  def __len__(self):
        'Denotes the total number of samples'
        return len(self.list_IDs)

  def __getitem__(self, index):
        'Generates one sample of data'
        # Select sample
        ID = self.list_IDs[index]

        # Load data and get label
        X = torch.load('data/' + ID + '.pt')
        y = self.labels[ID]

        if self.return_idx:
                return X, y, index
        return X, y

Then, you access externally to the list_IDs with the batch indexes

data_loader = data.DataLoader(dataset, args.batch_size,
                                  num_workers=args.num_workers,
                                  shuffle=True )

list_IDs = dataset.list_IDs

batch_iterator = iter(data_loader)
images, targets, idx  = next(batch_iterator)

ids = list_IDs[idx.cpu().numpy()]