Multi-label data loading bottleneck Pytorch

I am trying to write custom dataset and dataloader for pascal-voc-2007. It is a multi-label classification problem. There is csv file to hold the name of the images and their corresponding labels. I want to resize images and create one-hot encoded vectors for labels. Also I don’t want to load images to RAM, I want to load them when they appear in mini-batch.

So far I did all of them but with gtx 1070 (I dont’t think it is relevant when loading data), and i5 8600k, Samsung M.2 solid state, just loading 1 epoch of data(without feeding to the network, I just iterate over dataloader and look) took 1 min 7s.

Is it normal or can be considered as slow? Or is there some optimization problem in my code?

My code:

d = {}
for e in df["labels"]:
        l = e.split(" ")
        for k in l:
            if k not in d:
                d[k] = len(d)


def create_code(df, idx):
    l = [0] * len(d)
    
    labels = df.iloc[idx]["labels"].split(" ")
    labels_code = []
    
    for e in labels:
        l[d[e]] = 1
    
    return torch.tensor(l, dtype = torch.long).view(1,-1)


def transform_image(pth):
    img = Image.open(pth)
    return (ToTensor()(img.resize((256,256), resample = PIL.Image.ANTIALIAS))).unsqueeze(0).type(torch.float64)


class Dataset():
    def __init__(self, df, path): 
        self.df = df
        self.path = path
        
    def __len__(self): 
        return len(self.df)
    def __getitem__(self, idxs):
        if isinstance(idxs, int):
            imgs = transform_image(self.path/self.df.iloc[idxs]["fname"])
            labels = create_code(self.df, idxs)
        else:
            sub_df = self.df.iloc[idxs]
            imgs = transform_image(self.path/sub_df.iloc[0]["fname"])
            labels = create_code(sub_df, 0)
            for i in range(1, len(idxs)):
                e = sub_df.iloc[i]["fname"]
                imgs = torch.cat((imgs,transform_image(self.path/e)))
                labels = torch.cat((labels, create_code(sub_df, i)))
            
        return imgs, labels

            
class DataLoader():
    def __init__(self, ds, bs): 
        self.ds, self.bs = ds, bs
    def __iter__(self):
        n = len(self.ds)
        l = torch.randperm(n)

        
        for i in range(0, n, self.bs): 
            idxs_l = l[i:i+self.bs]
            yield self.ds[idxs_l]

Then:

train_ds = Dataset(df_train, path/"train")
train_dl = DataLoader(train_ds, 64)

%%time
for e in train_dl:
    pass

The last part took 1 min 7s

I don’t know the details but after doing profiling I found that torch.cat is taking too many times. So I updated my code like this:

class Dataset():
    def __init__(self, df, path): 
        self.df = df
        self.path = path
        
    def __len__(self): 
        return len(self.df)
    def __getitem__(self, idxs):
        if isinstance(idxs, int):
            imgs = transform_image(self.path/self.df.iloc[idxs]["fname"])
            labels = create_code(self.df, idxs)
        else:
            sub_df = self.df.iloc[idxs]
            imgs = []
            labels = []
            for i in range(len(idxs)):
                e = sub_df.iloc[i]["fname"]
                imgs.append(transform_image(self.path/e))
                labels.append(create_code(sub_df, i))
            
        return torch.stack(imgs), torch.stack(labels)
            

Instead of torc.cat, I am appending the list and then transforming the list and stacking elements, now iteration over 1 epoch takes 11.2 s

Dataset should only take one sample and Dataloader would stack them to one batch automatically.
I’m not sure you take the entire VOC dataset in Dataset, if so it’s the wrong way.
I think you can find VOCDataset in torchvision which should be a good example.

Thank you for your reply. In dataframe there is file_name column and labels column. I am storing that dataframe in Dataset, and when Dataloader wants random subset from it (index of dataframe’s rows in this case), Dataset just reads file names from dataframe and just loads those images

I’m sorry for misunderstanding.
If you just want to know which part cost the most time you could use time module to calculate them.

import time

start = time.time()
...(some cost time op)
print("This op cost {}".format(time.time()-start))

I run %prun and torch.cat is taking too much time, I changed it and it works fine now. But I am trying to make for loop parallel so that it can be faster. Thank you very much for your answers :+1: