Building custom dataset, how to return ids as well?

isalirezag · August 11, 2018, 6:30pm

so the format of a custom dataset should be like fllowing:

import torch
from torch.utils import data

class Dataset(data.Dataset):
  'Characterizes a dataset for PyTorch'
  def __init__(self, list_IDs, labels):
        'Initialization'
        self.labels = labels
        self.list_IDs = list_IDs

  def __len__(self):
        'Denotes the total number of samples'
        return len(self.list_IDs)

  def __getitem__(self, index):
        'Generates one sample of data'
        # Select sample
        ID = self.list_IDs[index]

        # Load data and get label
        X = torch.load('data/' + ID + '.pt')
        y = self.labels[ID]

        return X, y

I like to have have ID information in the output in addition to x and y. So i did return X, y, ID
, but now when I do

data_loader = data.DataLoader(dataset, args.batch_size,
                                  num_workers=args.num_workers,
                                  shuffle=True )

batch_iterator = iter(data_loader)
images, targets, id  = next(batch_iterator)

I receive an error,
anyone knows why?

justusschock · August 11, 2018, 6:56pm

All data returned by a dataset needs to be a tensor, if you want to use the default collate_fn of the Dataloader. You have two options: write a custom collate function and pass it to the dataloader or wrap your ID inside a tensor (which is simpler I guess) and unwrap it outside the dataloader.

isalirezag · August 11, 2018, 6:59pm

how can we wrap a string in a tensor?

justusschock · August 11, 2018, 7:01pm

Ah sorry, I implied your ID would be an integer. You cannot wrap a string to a tensor. I could think of some ways to achieve something like that, but it would not be very pytorch-like. If you are interested in these Ways you can PM me.

JuanFMontesinos · August 11, 2018, 7:21pm

very weird way but u can just conver string into integers through an asci table and convert them back to an string calling a function.

got from internet

>>> s = 'hi'
>>> [ord(c) for c in s]
[104, 105]

justusschock · August 11, 2018, 7:47pm

that’s what I thought about too. I also thought about wrapping the loader itself, but one would have to define a new iterator for this. I proposed another method, and if this method works (currently waiting for verification), I will post it here later on.

justusschock · August 11, 2018, 8:15pm

On another topic

While trying out my ideas, I think I came up with a very pytorch-like way:

Simply define a new collate_fn which calls the default_collate :
from torch.utils.data.dataloader import default_collate 


def id_collate(batch):
    new_batch = []
    ids = []
    for _batch in batch:
        new_batch.append(_batch[:-1])
        ids.append(_batch[-1])
    return default_collate(new_batch), ids
This could work If you pass it to your loader, since the ids (which have to be the last item in the tuple returned in the datasets __getitem__ will be removed before calling the default collate and simply returned afterwards.

@isalirezag reported this to work great.

Igo312 · November 1, 2021, 1:46pm

Sorry for late coming, but what I want to ask is does collate_fn can return a dict that some value is string now?

FedericoCozziVM · August 25, 2023, 10:34am

I think that another way to do this without building a custom collate function would be not to return the ID but directly the idx within the getitem implementation (which is numerical and can be treated in batches by the default collate function). Something like:

class Dataset(data.Dataset):
  'Characterizes a dataset for PyTorch'
  def __init__(self, list_IDs, labels, retrun_idx: bool = False):
        'Initialization'
        self.labels = labels
        self.list_IDs = list_IDs
        self.return_idx = return_idx

  def __len__(self):
        'Denotes the total number of samples'
        return len(self.list_IDs)

  def __getitem__(self, index):
        'Generates one sample of data'
        # Select sample
        ID = self.list_IDs[index]

        # Load data and get label
        X = torch.load('data/' + ID + '.pt')
        y = self.labels[ID]

        if self.return_idx:
                return X, y, index
        return X, y

Then, you access externally to the list_IDs with the batch indexes

data_loader = data.DataLoader(dataset, args.batch_size,
                                  num_workers=args.num_workers,
                                  shuffle=True )

list_IDs = dataset.list_IDs

batch_iterator = iter(data_loader)
images, targets, idx  = next(batch_iterator)

ids = list_IDs[idx.cpu().numpy()]