Class Dataset, concatenating labels with the corresponding images

michelepy · June 17, 2020, 8:41am

Hi everyone,
I am new to Pytorch, and in the last couple of days I have been struggling with the class Dataset that lets you build your custom dataset.

I am working with this dataset (https://www.kaggle.com/ianmoone0617/flower-goggle-tpu-classification/kernels) , the problem is that it has the images and their labels in separate folders, and I can’t figure out how to concatenate them.

I found this notebook (it’s not mine) where there is the code that performs the class Dataset, however I simply cannot understand it. The part I do not understand is the one where he iterates all files, could you guys be so kind to help me understand it? thank you so much in advance!

class MyDataset(Dataset):

def __init__(self, image_dir, label_dir, transform=None):
    _images, _labels = [], []
    # total amount of dataset 
    _number = 0
    # Reading the categorical file
    label_df = pd.read_csv(label_dir)
    
    # Iterate all files including .jpg images 
    for subdir, dirs, files in tqdm(os.walk(image_dir)):
        for filename in files:
            if len(subdir.split(os.sep)) >5:
                # 注意到 這裡如果不能讀檔一定是這裡發生問題,路徑要檢查一下
                corr_label = label_df[label_df['dirpath']==os.sep.join(subdir.split(os.sep)[5:])]['label'].values
                if corr_label.size!= 0 and filename.endswith(('jpg')):
                    _images.append(subdir + os.sep + filename)
                    _labels.append(corr_label)
                    _number+=1
    # Randomly arrange data pairs
    mapIndexPosition = list(zip(_images, _labels))
    random.shuffle(mapIndexPosition)
    _images, _labels = zip(*mapIndexPosition)

    self._image = iter(_images)
    self._labels = iter(_labels)
    self._number = _number
    self._category = label_df['label'].nunique()
    self.transform = transform
    
def __len__(self):
    return self._number

def __getitem__(self, index):    
    img = next(self._image)
    lab = next(self._labels)
    
    img = self._loadimage(img)
    if self.transform:
        img = self.transform(img)        
    return img, lab
 
def _categorical(self, label):
    return np.arange(self._category) == label[:,None]

def _loadimage(self, file):
    return Image.open(file).convert('RGB')

def get_categorical_nums(self):
    return self._category

bolt25 · June 17, 2020, 8:59am

I had a similar problem.
This video will definitely help you understand how to write your own custom dataset class

Feel free to ask if you still don’t get it.

Regards!

michelepy · June 17, 2020, 3:01pm

Hi bolt25, thank you so much for your fast reply!
I tried to implement the model, but when I create the DataLoader it gives me back this error: TypeError: join() argument must be str or bytes, not ‘int64’.
I feel that the mistake is that I cannot properly attach the labels to the images. I would be so grateful to understand this, thanks for your support!
Here is the code I used:
class MyDataset(Dataset):

def __init__(self, csv_file, root_dir, transform=None):
    self.labels = pd.read_csv(csv_file)
    self.root_dir = root_dir
    self.transform = transform
    
def __len__(self):
    return len(self.labels)

def __getitem__(self, index):
    if torch.is_tensor(index):
        index = index.tolist()
        
    image_name = os.path.join(self.root_dir, self.labels.iloc[index, 0])
    image = io.imread(image_name)
    
    if self.transform:
        image = self.transform(image)
        
    return (image, labels)

ds = MyDataset(csv_file="…/input/flower-goggle-tpu-classification/flowers_idx.csv", root_dir="…/input/flower-goggle-tpu-classification/flower_tpu/flower_tpu/flowers_google/flowers_google", transform=transforms.ToTensor())

This is the error:

TypeError Traceback (most recent call last)
in
----> 1 for image, label in data_dl:
2 print(image)
3 print(label)

/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py in next(self)
343
344 def next(self):
–> 345 data = self._next_data()
346 self._num_yielded += 1
347 if self._dataset_kind == _DatasetKind.Iterable and \

/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py in _next_data(self)
383 def _next_data(self):
384 index = self._next_index() # may raise StopIteration
–> 385 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
386 if self._pin_memory:
387 data = _utils.pin_memory.pin_memory(data)

/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
42 def fetch(self, possibly_batched_index):
43 if self.auto_collation:
—> 44 data = [self.dataset[idx] for idx in possibly_batched_index]
45 else:
46 data = self.dataset[possibly_batched_index]

/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py in (.0)
42 def fetch(self, possibly_batched_index):
43 if self.auto_collation:
—> 44 data = [self.dataset[idx] for idx in possibly_batched_index]
45 else:
46 data = self.dataset[possibly_batched_index]

in getitem(self, index)
13 index = index.tolist()
14
—> 15 image_name = os.path.join(self.root_dir, self.labels.iloc[index, 0])
16 image = io.imread(image_name)
17

/opt/conda/lib/python3.7/posixpath.py in join(a, *p)
92 path += sep + b
93 except (TypeError, AttributeError, BytesWarning):
—> 94 genericpath._check_arg_types(‘join’, a, *p)
95 raise
96 return path

/opt/conda/lib/python3.7/genericpath.py in _check_arg_types(funcname, *args)
151 else:
152 raise TypeError(’%s() argument must be str or bytes, not %r’ %
–> 153 (funcname, s.class.name)) from None
154 if hasstr and hasbytes:
155 raise TypeError(“Can’t mix strings and bytes in path components”) from None

TypeError: join() argument must be str or bytes, not ‘int64’

bolt25 · June 17, 2020, 5:03pm

Can you share a gist whats in the csv?
Like what kind of labels does it have?
Plus I think once you clear this error, this code will give one more error on the return line of your getitem function.

michelepy · June 17, 2020, 5:10pm

Yep this is the inside of the csv:

I think that the problem is in the variable image_name = os.path.join(self.root_dir, self.labels.iloc[index, 0])
but i really do not know why

p.s. yeah I noticed the error in the return of the getitem function already, but thanks!

michelepy · June 17, 2020, 5:13pm

Also in case it is useful to you, this is the structure of the various folders:
and the images are located in the last folder flowers_google

bolt25 · June 17, 2020, 5:32pm

Change the following line:-

To:-
image_name = os.path.join(self.root_dir, str(self.labels.iloc[index, 0]))

and your label would be-
labels = self.labels.iloc[index, 1]

I hope this works out for you!

Feel free to text back!
Regards

michelepy · June 18, 2020, 5:48am

Hi Dharmik! I just wanted to let you know that thanks to your advices I was able to build the dataset class, thank you so much man, I really appreciate!
At the end I was able to attach each image to its label by its id, and then I only had to add the format extension at the end of each image_name and the class was able to unite them!
Have a happy day!

bolt25 · June 18, 2020, 6:07am

Congrats brother!
Happy to help!
Just make sure you click on the solution box to make my answer the solution to this thread! Maybe it’ll help others!