How to load images from a folder whose names are stored in a csv file

raoashish10 · May 22, 2020, 10:38pm

I have a csv file which has the names of the specific images which have to be loaded into a dataloader from a folder. This csv file also contains the labels/classes of the images. I tried using a dataset class. I had written the below code:

 class data(Dataset):
  def __init__(self,type='train',transform=None):
    self.imagecsv=''
    if type=='train':
      self.imagecsv=traincsv
    else:
      self.imagecsv=testcsv
    imagelabels=self.imagecsv[['image_id','Labels']]
    images=[]
    for i in imagelabels['image_id']:
      image=imread(os.path.join(path,i+'.jpg'))
      images.append(image)
    imagelabels['images']=images
    imagelabels.drop(columns='image_id')
      

  def __getitem__(self,idx):
    return imagelabels['images'][idx],imagelabels['Labels'][idx]

When I try to run this code, the system crashes everytime. I am running it on google colab with a GPU. I know I haven’t added an index parameter in the getitem function, it is because I need all the images whose names are specified in the training csv file. Is there a cleaner implementation which won’t crash my system? I also tried using Image.open but the system crashed again.

Rohan_Kumar · May 23, 2020, 12:14am

every time you index or call retrieve an item from your data class, your code processes all the images again. which i think would be a mistake while training.

shift the data retrieving part in init, and then just return what you have to return in getitem . and shouldnt it be def getitem( self, index) ?

raoashish10 · May 23, 2020, 6:40am

I assume it is okay to drop the index parameter from the getitem function. I haven’t included because I wanted the whole dataset every time. Anyway thank you for your answer! I will try it

raoashish10 · May 23, 2020, 10:56am

I tried this method, I’ll update the code. However, my system still tends to crash. Any other implementation ideas?

Nikronic · May 23, 2020, 11:12am

Hi,

I am not sure about your code, but there is easier ways to handle it. You can define your custom Dataset class and have all data in each iteration just by setting batch_size=len(data) and you no longer have to change your code for different purposes. And also, you have to implement __len__ function too.

Here is conventional implementation of custom dataset class:

After creating such a Dataset, you can use it for different purposes like this:

train_dataset = PlacesDataset(txt_path=args.txt,
                              img_dir=args.img,
                              transform=custom_transforms)

train_loader = DataLoader(dataset=train_dataset,
                          batch_size=args.bs,
                          shuffle=True,
                          num_workers=args.nw,
                          pin_memory=pin_memory)

bests

raoashish10 · May 23, 2020, 5:27pm

Thank you so much for helping me out! I followed the code instructions in the link you have provided and I must say a lot of concepts have been cleared for me now. This was very helpful.

Nikronic · May 23, 2020, 6:29pm

You are welcome.
I think best way to learn about best practices is to read the well-constructed codes which in this case, we can refer to implemented models in PyTorch iteslf. For instance, ResNet, VGG, etc have been implemented by PyTorch developers and they used most general and easy ways to do it, so by reading these codes, we can adopt and have a more reliable codes.

Good luck

raoashish10 · May 23, 2020, 6:41pm

Thanks for the advice! I will check out those codes to have a better understanding