How to access image names from dataloader?

Asya · September 28, 2021, 3:07pm

I am loading images as following:

train_data = datasets.ImageFolder(data_dir, transform=transform['train'])
lengths = [len(train_data) * 0.6, len(train_data) * 0.3, len(train_data) * 0.1]
train_data, val_data, test_data = random_split(train_data, lengths)

train_loader = torch.utils.data.DataLoader(train_data, shuffle=True, batch_size=batch_size, num_workers=4)
val_loader = torch.utils.data.DataLoader(val_data, batch_size=batch_size)
test_loader = torch.utils.data.DataLoader(test_data, shuffle=False, batch_size=batch_size)

how I can extract image names (eg ‘img1.jpg’) from such dataloader structure?

ptrblck · September 29, 2021, 4:46am

You could write a custom Dataset and return the image names in the __getitem__. This tutorial might be a good starter.

Asya · September 29, 2021, 10:38pm

I wrote a CustomImageDataset class, it now allows me to access image file names, but I now got another problem - it started to be way slower than when I used datasets.ImageFolder with torch.utils.data.DataLoader

Could you please advise what should I change to make it faster? (and one more problem I have is that I can’t use big batch size as it doesn’t fit in GPU memory despite that I do del my_vars and torch.cuda.empty_cache()

class CustomImageDataset(Dataset):
    def __init__(self, annotations_file): 
        self.img_labels = annotations_file

    def __len__(self):
        return len(self.img_labels)

    def __getitem__(self, idx, idx2):
        img_path = self.img_labels[self.img_labels.columns[0]][idx:idx2]
        all_imgs = np.empty((0, 3, img_width, img_height), float)
        for imid in img_path:
            image = Image.open(imid)
            image = transforms.Scale((img_width, img_height), interpolation=2)(image)
            image = transforms.ToTensor()(image)
            image = image.unsqueeze(0)
            all_imgs = np.append(all_imgs, image, axis=0)

        label = torch.tensor(self.img_labels['class'][idx:idx2].values)
        all_imgs = torch.tensor(all_imgs)
        return all_imgs, label, img_path

df = pd.DataFrame(pics, columns = ['img_paths'])
df['class'] = df['img_paths'].apply(lambda row: row.split('/')[3]).astype('category').cat.codes.astype(int)
classes = np.unique(df['class'])
df = shuffle(df)
train_data = df[0:(len(df)/ /10)*7]
val_data = df[len(train_data):(len(df)/ /10)*9]
test_data =  df[len(train_data)+len(val_data):]
train_data = CustomImageDataset(train_data)
val_data = CustomImageDataset(val_data)
test_data = CustomImageDataset(test_data)

ptrblck · September 30, 2021, 12:06am

I don’t know how your custom Dataset works exactly as it seems a custom sampler would be needed, since __getitem__ expects two index values? If that’s the case, how many images are you loading? ImageFolder would load a single sample in the __getitem__ and I don’t know how you let it load two images so what’s the baseline you are comparing against?
To get some speedup you could use tensor = torch.from_numpy(arr) to share the underlying data instead of creating a copy via tensor = torch.tensor(arr).

I don’t know what my_vars is, but you might need to reduce the batch size to be able to train your model. Deleting unused tensors would reduce the memory requirement, but calling empty_cache() would slow down your code without avoiding the OOM issue.

Asya · September 30, 2021, 12:57am

I’m using two index values in __getitem__ to create a batch:
data, labels, img_path = train_data.__getitem__(i, i + batch_size)
where I will be able to access image file names, so indices are just for batch slicing

the total number of images is around 80k (currently wikiart dataset, but later i will have to use 300k+ set of images) so I’m not able to keep the whole set of images together in memory, and I’m trying to find the most efficient way to load them by batches

my_vars it’s just all my variables that I am deleting after use (but it’s still not enough)

and seems like Im not able to use batch > 80 ( but when I load images with datasets.ImageFolder without image file names, I can easily use batch size 300 for example )

Asya · September 30, 2021, 2:03am

i measured the time and it turnes out that slowness came from loss.backward() - it takes up to 20 seconds per batch now

but when I was loading images via datasets.ImageFolder it was way faster

ptrblck · September 30, 2021, 5:27am

The loss.backward operation isn’t sped up or slowed down by the usage of another Dataset, but depends on the model architecture as well as input shapes.
If you’ve increased the input shapes in the new workflow, a slow down might be expected. Otherwise, I would guess you are timing your code wrong.

Your current __getitem__ approach would not leverage the DataLoader, which could pre-load the batches using multiprocessing. If you want to load an entire batch in each worker, use the BatchSampler instead.