Class Dataset, concatenating labels with the corresponding images

Hi everyone,
I am new to Pytorch, and in the last couple of days I have been struggling with the class Dataset that lets you build your custom dataset.

I am working with this dataset (https://www.kaggle.com/ianmoone0617/flower-goggle-tpu-classification/kernels) , the problem is that it has the images and their labels in separate folders, and I can’t figure out how to concatenate them.

I found this notebook (it’s not mine) where there is the code that performs the class Dataset, however I simply cannot understand it. The part I do not understand is the one where he iterates all files, could you guys be so kind to help me understand it? thank you so much in advance! :sweat_smile:

class MyDataset(Dataset):

def __init__(self, image_dir, label_dir, transform=None):
    _images, _labels = [], []
    # total amount of dataset 
    _number = 0
    # Reading the categorical file
    label_df = pd.read_csv(label_dir)
    
    # Iterate all files including .jpg images 
    for subdir, dirs, files in tqdm(os.walk(image_dir)):
        for filename in files:
            if len(subdir.split(os.sep)) >5:
                # 注意到 這裡如果不能讀檔一定是這裡發生問題,路徑要檢查一下
                corr_label = label_df[label_df['dirpath']==os.sep.join(subdir.split(os.sep)[5:])]['label'].values
                if corr_label.size!= 0 and filename.endswith(('jpg')):
                    _images.append(subdir + os.sep + filename)
                    _labels.append(corr_label)
                    _number+=1
    # Randomly arrange data pairs
    mapIndexPosition = list(zip(_images, _labels))
    random.shuffle(mapIndexPosition)
    _images, _labels = zip(*mapIndexPosition)

    self._image = iter(_images)
    self._labels = iter(_labels)
    self._number = _number
    self._category = label_df['label'].nunique()
    self.transform = transform
    
def __len__(self):
    return self._number

def __getitem__(self, index):    
    img = next(self._image)
    lab = next(self._labels)
    
    img = self._loadimage(img)
    if self.transform:
        img = self.transform(img)        
    return img, lab
 
def _categorical(self, label):
    return np.arange(self._category) == label[:,None]

def _loadimage(self, file):
    return Image.open(file).convert('RGB')

def get_categorical_nums(self):
    return self._category

I had a similar problem.
This video will definitely help you understand how to write your own custom dataset class

Feel free to ask if you still don’t get it.

Regards!

Hi bolt25, thank you so much for your fast reply!
I tried to implement the model, but when I create the DataLoader it gives me back this error: TypeError: join() argument must be str or bytes, not ‘int64’.
I feel that the mistake is that I cannot properly attach the labels to the images. I would be so grateful to understand this, thanks for your support!
Here is the code I used:
class MyDataset(Dataset):

def __init__(self, csv_file, root_dir, transform=None):
    self.labels = pd.read_csv(csv_file)
    self.root_dir = root_dir
    self.transform = transform
    
def __len__(self):
    return len(self.labels)

def __getitem__(self, index):
    if torch.is_tensor(index):
        index = index.tolist()
        
    image_name = os.path.join(self.root_dir, self.labels.iloc[index, 0])
    image = io.imread(image_name)
    
    if self.transform:
        image = self.transform(image)
        
    return (image, labels)

ds = MyDataset(csv_file="…/input/flower-goggle-tpu-classification/flowers_idx.csv", root_dir="…/input/flower-goggle-tpu-classification/flower_tpu/flower_tpu/flowers_google/flowers_google", transform=transforms.ToTensor())

This is the error:

TypeError Traceback (most recent call last)
in
----> 1 for image, label in data_dl:
2 print(image)
3 print(label)

/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py in next(self)
343
344 def next(self):
–> 345 data = self._next_data()
346 self._num_yielded += 1
347 if self._dataset_kind == _DatasetKind.Iterable and \

/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py in _next_data(self)
383 def _next_data(self):
384 index = self._next_index() # may raise StopIteration
–> 385 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
386 if self._pin_memory:
387 data = _utils.pin_memory.pin_memory(data)

/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
42 def fetch(self, possibly_batched_index):
43 if self.auto_collation:
—> 44 data = [self.dataset[idx] for idx in possibly_batched_index]
45 else:
46 data = self.dataset[possibly_batched_index]

/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py in (.0)
42 def fetch(self, possibly_batched_index):
43 if self.auto_collation:
—> 44 data = [self.dataset[idx] for idx in possibly_batched_index]
45 else:
46 data = self.dataset[possibly_batched_index]

in getitem(self, index)
13 index = index.tolist()
14
—> 15 image_name = os.path.join(self.root_dir, self.labels.iloc[index, 0])
16 image = io.imread(image_name)
17

/opt/conda/lib/python3.7/posixpath.py in join(a, *p)
92 path += sep + b
93 except (TypeError, AttributeError, BytesWarning):
—> 94 genericpath._check_arg_types(‘join’, a, *p)
95 raise
96 return path

/opt/conda/lib/python3.7/genericpath.py in _check_arg_types(funcname, *args)
151 else:
152 raise TypeError(’%s() argument must be str or bytes, not %r’ %
–> 153 (funcname, s.class.name)) from None
154 if hasstr and hasbytes:
155 raise TypeError(“Can’t mix strings and bytes in path components”) from None

TypeError: join() argument must be str or bytes, not ‘int64’

Can you share a gist whats in the csv?
Like what kind of labels does it have?
Plus I think once you clear this error, this code will give one more error on the return line of your getitem function.

Yep this is the inside of the csv:


I think that the problem is in the variable image_name = os.path.join(self.root_dir, self.labels.iloc[index, 0])
but i really do not know why :pensive:

p.s. yeah I noticed the error in the return of the getitem function already, but thanks!

Also in case it is useful to you, this is the structure of the various folders:
and the images are located in the last folder flowers_google
image

Change the following line:-

To:-
image_name = os.path.join(self.root_dir, str(self.labels.iloc[index, 0]))

and your label would be-
labels = self.labels.iloc[index, 1]

I hope this works out for you!

Feel free to text back!
Regards

Hi Dharmik! I just wanted to let you know that thanks to your advices I was able to build the dataset class, thank you so much man, I really appreciate!
At the end I was able to attach each image to its label by its id, and then I only had to add the format extension at the end of each image_name and the class was able to unite them!
Have a happy day! :grin:

Congrats brother!
Happy to help!
Just make sure you click on the solution box to make my answer the solution to this thread! Maybe it’ll help others!

1 Like