Custom Dataset labeling from CSV

Kevin_Diaz_Guarneros · September 9, 2019, 6:21pm

Hi, I’m trying to start my first pytorch project from a Kaggle Dataset, the goal is to simply classify some images.

So far I’ve managed to use ImageFolder to use my own Dataset but it lacks the labels of all images.

The issue lies here: The dataset by itself contains 2 folders Train and Test.

Inside Train there are 26684 images.
Inside Test there are 3000 images.
And there’s a csv file with two columns File name and Class, this csv matches the image with it’s corresponding label. (There are only 3 different classes for all the images)

I’ve tried to find some tutorials on internet on how to handle this, but so far all responses available are related to organize my images into a folder structure like this:

root/dog/xxx.png
root/dog/xxy.png
root/dog/xxz.png

root/cat/123.png
root/cat/nsdf3.png
root/cat/asd932_.png

But thinking in a more real life example, I thought that there must be a way of avoiding me classifying my own images into subfolders if there’s already a reference file for the computer to do this by it’s own.

So I’m sure I’m missing something, could any please help me on how would you do this?
I’ve published my progress on github on the following link Pneumonia_Kevin

Nikronic · September 9, 2019, 8:06pm

Hi,

You can define your own custom dataset class easily to handle this kind of situations.
Here is the top-level structure of the class your can implement:


class PlacesDataset(Dataset):
  def __init__():
     # initialize variables such is path to csv file and images and transforms
  def __len__():
    # here you just need to return a single integer number as the length of your dataset, in your 
    #  case, number of images in your train folder or lines in csv file
  def __getitem__(): 
    # this is the most important part, you need to define a code to read images from folder and
    # labels from csv files and return only a pair of (image, class). Note that here, you just 
    # need to consider 1 sample no more. Let say, you have only 1 image in your whole 
    # dataset, the method will work on batches parallely when you pass it to DataLoader class.

Now, you can do whatever you wanted to do with ImageFolder with this class too.
I know the explanation is too abstract, but this is the whole idea and if you need a real code which works, the link below is mine which uses a csv file to read images and generate labels on the go.

github.com

Nikronic/CoarseNet/blob/master/utils/preprocess.py

from __future__ import print_function, division
from PIL import Image
from torchvision.transforms import ToTensor, ToPILImage, Normalize, Compose
from torch.utils.data import DataLoader
import numpy as np
import random

import tarfile
import io
import os
import pandas as pd

from torch.utils.data import Dataset
import torch

from utils.Halftone.halftone import generate_halftone


class PlacesDataset(Dataset):
    def __init__(self, txt_path='filelist.txt', img_dir='data', transform=None, test=False):

This file has been truncated. show original

If you had any questions, feel free to ask.

Bests
Nik

Kevin_Diaz_Guarneros · September 9, 2019, 8:33pm

Hi @Nikronic Nik,
I think I get it. I’ll try it and if I get stuck I’ll let you know. Thanks a lot for the guidance.

Kevin_Diaz_Guarneros · September 9, 2019, 10:55pm

@Nikronic Hi Nik,
I managed to create my DataSet following your indications along other resources on internet. It seems that it worked! Thanks! I’m able to see my images along with their classes.
But I’m left with a question during this process.
Before doing this I created the dataset using ImageFolder, and when visualizing the images they kind of were seen as blue, the tensor shape was [3 224 224]
But after creating my own Dataset, an error was displaying indicating that he shape [1 224 224] was not ~“appropiate” for the [3 224 224] specified, something like that.
I had to alter my transforms so it would match the actual shape.
I’m not sure if I’m completely clear, but what I noticed is that at first with ImageFolder dataset the dimension of the images was 3 224 224 [RGB] but with my own dataset I have 1 224 224. Do you happen to know the explanation of this? I’m pretty sure I barely changed something, nor did anything beyond extraordinary.
I’m attaching my github project for you to see, both behaviors are seen on the notebook.
Thanks in advance. Kevin Projct

Nikronic · September 10, 2019, 11:46am

About the first problem the “blue thing”. When you use transforms.ToTensor() PyTorch standardize your input from [0, 255] to [0, 1]. So if you want to visualize these images using matplotlib, you need to first convert back to [0, 255] and you can use torch.ToPILImage() to extract a batch of your images then convert them to numpy and plot using matplotlib. torch.vutils is a good approach to visualize your images during training or testing process.

# obtain one batch of training images
dataiter = iter(train_loader)
images, labels = dataiter.next()
images = images.numpy() # convert images to numpy for display

# plot the images in the batch, along with the corresponding labels
fig = plt.figure(figsize=(25, 4))
for idx in np.arange(20):
    ax = fig.add_subplot(2, 20/2, idx+1, xticks=[], yticks=[])
    plt.imshow(np.transpose(images[idx], (1, 2, 0)))
    #ax.set_title(classes[labels[idx]])

About the size problem. I saw your code and I cannot really figure out where you have such a problem, I just noticed you used transforms.Normalize for single channel image.

transforms.Normalize([0.5,], std=[0.5,])

First thing came to my mind is that all of your images are grayscale. Right? So when you use PIL library, it figures out the type of your image between possible type such as RGB, CMYK, aRGB, I (binary), etc. So I really when saw images at the start of notebook, I thought why opencv is saying it has 3 channels. Actually, I think [1, 224, 224] is the right shape for your images but if you insist to use RGB, as you have vgg as a feature extractor, you can just set mode argument in Image.open(path, mode='string') method to force PIL to use RGB encodings.

image = Image.open(img_name+'.jpg')

Final note that you already took care of it is that PyTorch uses [batch_size, channel_size, height, width] convention, in contrary, numpy or others (I don’t know about the others!), use batch_size, height, width, channel_size] approach. So be aware of it when converting to each other.

PS: It is really hard to comment on jupyter notebook in github, if it was .py file, I would comment these things on the aforementioned lines.

Kevin_Diaz_Guarneros · September 10, 2019, 1:49pm

Thanks a lot @Nikronic, I understand. I appreciate the explanation. Have a nice day!

Nikronic · September 11, 2019, 12:07pm

You are welcome mate. Thanks you too. Good luck.

Arnaud_Mal · April 21, 2020, 4:24am

Hello,

I hope you were able to submit this Kaggle.

I am in a similar situation as you - trying to submit my first Kaggle competition. I have a similar problem (link) and I created a discussion for it.

I was able to create a CustomDataset that return an image and a label (both tensor). Then I pass them to the Dataloader, but then, we I get the Image and Target from the Dataloader in the BackPropagation, the size is not right.

This is the link to the github so I can track my progress

Any ideas, suggestions?