I am loading images as following:
train_data = datasets.ImageFolder(data_dir, transform=transform['train'])
lengths = [len(train_data) * 0.6, len(train_data) * 0.3, len(train_data) * 0.1]
train_data, val_data, test_data = random_split(train_data, lengths)
train_loader = torch.utils.data.DataLoader(train_data, shuffle=True, batch_size=batch_size, num_workers=4)
val_loader = torch.utils.data.DataLoader(val_data, batch_size=batch_size)
test_loader = torch.utils.data.DataLoader(test_data, shuffle=False, batch_size=batch_size)
how I can extract image names (eg ‘img1.jpg’) from such dataloader structure?
You could write a custom
Dataset and return the image names in the
__getitem__. This tutorial might be a good starter.
I wrote a CustomImageDataset class, it now allows me to access image file names, but I now got another problem - it started to be way slower than when I used
Could you please advise what should I change to make it faster? (and one more problem I have is that I can’t use big batch size as it doesn’t fit in GPU memory despite that I do
del my_vars and
def __init__(self, annotations_file):
self.img_labels = annotations_file
def __getitem__(self, idx, idx2):
img_path = self.img_labels[self.img_labels.columns][idx:idx2]
all_imgs = np.empty((0, 3, img_width, img_height), float)
for imid in img_path:
image = Image.open(imid)
image = transforms.Scale((img_width, img_height), interpolation=2)(image)
image = transforms.ToTensor()(image)
image = image.unsqueeze(0)
all_imgs = np.append(all_imgs, image, axis=0)
label = torch.tensor(self.img_labels['class'][idx:idx2].values)
all_imgs = torch.tensor(all_imgs)
return all_imgs, label, img_path
df = pd.DataFrame(pics, columns = ['img_paths'])
df['class'] = df['img_paths'].apply(lambda row: row.split('/')).astype('category').cat.codes.astype(int)
classes = np.unique(df['class'])
df = shuffle(df)
train_data = df[0:(len(df)/ /10)*7]
val_data = df[len(train_data):(len(df)/ /10)*9]
test_data = df[len(train_data)+len(val_data):]
train_data = CustomImageDataset(train_data)
val_data = CustomImageDataset(val_data)
test_data = CustomImageDataset(test_data)
I don’t know how your custom
Dataset works exactly as it seems a custom sampler would be needed, since
__getitem__ expects two index values? If that’s the case, how many images are you loading?
ImageFolder would load a single sample in the
__getitem__ and I don’t know how you let it load two images so what’s the baseline you are comparing against?
To get some speedup you could use
tensor = torch.from_numpy(arr) to share the underlying data instead of creating a copy via
tensor = torch.tensor(arr).
I don’t know what
my_vars is, but you might need to reduce the batch size to be able to train your model. Deleting unused tensors would reduce the memory requirement, but calling
empty_cache() would slow down your code without avoiding the OOM issue.
I’m using two index values in
__getitem__ to create a batch:
data, labels, img_path = train_data.__getitem__(i, i + batch_size)
where I will be able to access image file names, so indices are just for batch slicing
the total number of images is around 80k (currently wikiart dataset, but later i will have to use 300k+ set of images) so I’m not able to keep the whole set of images together in memory, and I’m trying to find the most efficient way to load them by batches
my_vars it’s just all my variables that I am deleting after use (but it’s still not enough)
and seems like Im not able to use batch > 80 ( but when I load images with
datasets.ImageFolder without image file names, I can easily use batch size 300 for example )
i measured the time and it turnes out that slowness came from
loss.backward() - it takes up to 20 seconds per batch now
but when I was loading images via
datasets.ImageFolder it was way faster
loss.backward operation isn’t sped up or slowed down by the usage of another
Dataset, but depends on the model architecture as well as input shapes.
If you’ve increased the input shapes in the new workflow, a slow down might be expected. Otherwise, I would guess you are timing your code wrong.
__getitem__ approach would not leverage the
DataLoader, which could pre-load the batches using multiprocessing. If you want to load an entire batch in each worker, use the