Custom Dataset labeling from CSV

Hi, I’m trying to start my first pytorch project from a Kaggle Dataset, the goal is to simply classify some images.

So far I’ve managed to use ImageFolder to use my own Dataset but it lacks the labels of all images.

The issue lies here: The dataset by itself contains 2 folders Train and Test.

  • Inside Train there are 26684 images.
  • Inside Test there are 3000 images.
  • And there’s a csv file with two columns File name and Class, this csv matches the image with it’s corresponding label. (There are only 3 different classes for all the images)

I’ve tried to find some tutorials on internet on how to handle this, but so far all responses available are related to organize my images into a folder structure like this:

root/dog/xxx.png
root/dog/xxy.png
root/dog/xxz.png

root/cat/123.png
root/cat/nsdf3.png
root/cat/asd932_.png

But thinking in a more real life example, I thought that there must be a way of avoiding me classifying my own images into subfolders if there’s already a reference file for the computer to do this by it’s own.

So I’m sure I’m missing something, could any please help me on how would you do this?
I’ve published my progress on github on the following link Pneumonia_Kevin

Hi,

You can define your own custom dataset class easily to handle this kind of situations.
Here is the top-level structure of the class your can implement:


class PlacesDataset(Dataset):
  def __init__():
     # initialize variables such is path to csv file and images and transforms
  def __len__():
    # here you just need to return a single integer number as the length of your dataset, in your 
    #  case, number of images in your train folder or lines in csv file
  def __getitem__(): 
    # this is the most important part, you need to define a code to read images from folder and
    # labels from csv files and return only a pair of (image, class). Note that here, you just 
    # need to consider 1 sample no more. Let say, you have only 1 image in your whole 
    # dataset, the method will work on batches parallely when you pass it to DataLoader class.

Now, you can do whatever you wanted to do with ImageFolder with this class too.
I know the explanation is too abstract, but this is the whole idea and if you need a real code which works, the link below is mine which uses a csv file to read images and generate labels on the go.

If you had any questions, feel free to ask.

Bests
Nik

1 Like

Hi @Nikronic Nik,
I think I get it. I’ll try it and if I get stuck I’ll let you know. Thanks a lot for the guidance. :grinning:

1 Like

@Nikronic Hi Nik,
I managed to create my DataSet following your indications along other resources on internet. It seems that it worked! Thanks! I’m able to see my images along with their classes.
But I’m left with a question during this process.
Before doing this I created the dataset using ImageFolder, and when visualizing the images they kind of were seen as blue, the tensor shape was [3 224 224]
But after creating my own Dataset, an error was displaying indicating that he shape [1 224 224] was not ~“appropiate” for the [3 224 224] specified, something like that.
I had to alter my transforms so it would match the actual shape.
I’m not sure if I’m completely clear, but what I noticed is that at first with ImageFolder dataset the dimension of the images was 3 224 224 [RGB] but with my own dataset I have 1 224 224. Do you happen to know the explanation of this? I’m pretty sure I barely changed something, nor did anything beyond extraordinary.
I’m attaching my github project for you to see, both behaviors are seen on the notebook.
Thanks in advance. Kevin Projct

About the first problem the “blue thing”. When you use transforms.ToTensor() PyTorch standardize your input from [0, 255] to [0, 1]. So if you want to visualize these images using matplotlib, you need to first convert back to [0, 255] and you can use torch.ToPILImage() to extract a batch of your images then convert them to numpy and plot using matplotlib. torch.vutils is a good approach to visualize your images during training or testing process.

# obtain one batch of training images
dataiter = iter(train_loader)
images, labels = dataiter.next()
images = images.numpy() # convert images to numpy for display

# plot the images in the batch, along with the corresponding labels
fig = plt.figure(figsize=(25, 4))
for idx in np.arange(20):
    ax = fig.add_subplot(2, 20/2, idx+1, xticks=[], yticks=[])
    plt.imshow(np.transpose(images[idx], (1, 2, 0)))
    #ax.set_title(classes[labels[idx]])

About the size problem. I saw your code and I cannot really figure out where you have such a problem, I just noticed you used transforms.Normalize for single channel image.

transforms.Normalize([0.5,], std=[0.5,])

First thing came to my mind is that all of your images are grayscale. Right? So when you use PIL library, it figures out the type of your image between possible type such as RGB, CMYK, aRGB, I (binary), etc. So I really when saw images at the start of notebook, I thought why opencv is saying it has 3 channels. Actually, I think [1, 224, 224] is the right shape for your images but if you insist to use RGB, as you have vgg as a feature extractor, you can just set mode argument in Image.open(path, mode='string') method to force PIL to use RGB encodings.

image = Image.open(img_name+'.jpg')

Final note that you already took care of it is that PyTorch uses [batch_size, channel_size, height, width] convention, in contrary, numpy or others (I don’t know about the others!), use batch_size, height, width, channel_size] approach. So be aware of it when converting to each other.

PS: It is really hard to comment on jupyter notebook in github, if it was .py file, I would comment these things on the aforementioned lines.

1 Like

Thanks a lot @Nikronic, I understand. I appreciate the explanation. Have a nice day! :slight_smile:

You are welcome mate. Thanks you too. Good luck. :smiley: