Each value is converted to a tensor weirdly in my Dataset

Farzam · July 16, 2024, 3:37pm

Hey everybody, I have these 2 lists as inputs_ids and labels_ids; values are token_ids that represent one token:

batch = 1

input_ids:

[[30, 26, 22, 21, 9, 1, 21, 26, 23],
 [30, 26, 22, 0, 9, 14, 32, 2, 4, 0],
 [30, 26, 22, 6, 24, 9, 1, 6, 24, 26, 31],
 [30, 25, 12, 18, 3, 9, 27, 8],
 [5, 12, 28, 20, 9, 15, 14, 28, 11],
 [5, 12, 28, 10, 9, 19, 14, 28, 11],
 [10, 29, 20, 9, 17, 16, 13]]

labels_ids:

[[26, 22, 21, 9, 1, 21, 26, 23, 9],
 [26, 22, 0, 9, 14, 32, 2, 4, 0, 9],
 [26, 22, 6, 24, 9, 1, 6, 24, 26, 31, 9],
 [25, 12, 18, 3, 9, 27, 8, 9],
 [12, 28, 20, 9, 15, 14, 28, 11, 9],
 [12, 28, 10, 9, 19, 14, 28, 11, 9],
 [29, 20, 9, 17, 16, 13, 9]]

I tried to build a dataset class and then a dataloader object:

class TokenDataset(Dataset):
  def __init__(self,inputs_ids: List,labels_ids: List) -> None:
    self.inputs_ids = inputs_ids
    self.labels_ids = labels_ids

  def __len__(self):
    return len(self.labels_ids)

  def __getitem__(self, idx):
    input = self.inputs_ids[idx]
    label = self.labels_ids[idx]

    return input,label

dataset_dclass = TokenDataset(inputs_ids , labels_ids)
dataloader_dclass = DataLoader(dataset=dataset_dclass , batch_size=batch)

The only problem is that all values in each input and label are converted to a tensor ! I’m wonder why and I can’t understand:

dataset_dclass[0]:

Output:

([30, 26, 22, 21, 9, 1, 21, 26, 23], [26, 22, 21, 9, 1, 21, 26, 23, 9])

next(iter(dataloader_dclass))

Output:

[[tensor([30]),tensor([26]),tensor([22]),tensor([21]),tensor([9]),tensor([1]),tensor([21]),tensor([26]),
  tensor([23])],
 [tensor([26]),tensor([22]),tensor([21]),tensor([9]),tensor([1]),tensor([21]),tensor([26]),tensor([23]),tensor([9])]]

I expect to get this as an output for next(iter(dataloader_dclass)) :

[ tensor([30, 26, 22, 21, 9, 1, 21, 26, 23]) , tensor([26, 22, 21, 9, 1, 21, 26, 23, 9]) ]

Farzam · July 16, 2024, 4:12pm

I forgot to convert values into tensors, and this (converting all values into tensors) happened automatically by torch!

Also, I assigned a lambda function to dataloader lambda x : x[0] to decrease the dimension.

class TokenDataset(Dataset):
  def __init__(self,inputs_ids: List,labels_ids: List) -> None:
    self.inputs_ids = inputs_ids
    self.labels_ids = labels_ids

  def __len__(self):
    return len(self.labels_ids)

  def __getitem__(self, idx):
    input = torch.tensor(self.inputs_ids[idx])
    label = torch.tensor(self.labels_ids[idx])

    return input,label

dataset_dclass = TokenDataset(inputs_ids , labels_ids)
dataloader_dclass = DataLoader(dataset=dataset_dclass , batch_size=batch ,  collate_fn= lambda x : x[0])

next(iter(dataloader_dclass)):

output:

(tensor([30, 26, 22, 21,  9,  1, 21, 26, 23]),
tensor([26, 22, 21,  9,  1, 21, 26, 23,  9]))