DataLoader Not Iterating Correctly

I have a pandas dataframe containing tokenized input sequences and labels in list format.

However, I’m having a hard time creating a CustomDataset and DataLoader to convert these lists to torch tensors and batch them together. I’ve created a dummy pandas dataframe for demonstration. Here is my code:

#dummy dataframe
import random
data = {
    "text_input_ids": [[random.randint(1, 100) for _ in range(10)] for _ in range(100)],  # Lists of length 10
    "d_icd_desc_code_label": [[random.randint(0, 1) for _ in range(15)] for _ in range(100)],  # Lists of length 15
}

# Create the DataFrame
test_df = pd.DataFrame(data)

class CustomDataset(Dataset):
    def __init__(self, dataframe):
        assert isinstance(dataframe, pd.DataFrame), 'dataframe must be of type pd.DataFrame'
        self.dataframe = dataframe

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        input_tokens = self.dataframe.iloc[idx]['text_input_ids'] 
        label_tokens = self.dataframe.iloc[idx]['d_icd_desc_code_label']
        return torch.tensor(input_tokens), torch.tensor(label_tokens)

train_ds = CustomDataset(dataframe=test_df)

Checking one element in the Dataset object:

train_ds[0]

returns:

(tensor([15, 56, 23,  8, 45, 23, 36, 20, 86, 21]),
 tensor([0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1]))

However, attempting:

train_loader = torch.utils.data.DataLoader(train_ds,batch_size=2,shuffle=True)
next(iter(train_loader))

returns

Cell In[803], line 12, in CustomDataset.__getitem__(self, idx)
     10 input_tokens = self.dataframe.iloc[idx]['text_input_ids'] 
     11 label_tokens = self.dataframe.iloc[idx]['d_icd_desc_code_label']
---> 12 return torch.tensor(input_tokens), torch.tensor(label_tokens)

ValueError: could not determine the shape of object type 'Series'

and attempting

train_loader = torch.utils.data.DataLoader(train_ds,batch_size=2,shuffle=False)
next(iter(train_loader))

returns

File c:\Users\SPADMINKH\Documents\Repo\Auto Coding MIMIC Prototype\venv-autocoding\Lib\site-packages\datasets\arrow_dataset.py:2808, in Dataset.__getitems__(self, keys)
   2806 """Can be used to get a batch using a list of integers indices."""
   2807 batch = self.__getitem__(keys)
-> 2808 n_examples = len(batch[next(iter(batch))])
   2809 return [{col: array[i] for col, array in batch.items()} for i in range(n_examples)]

TypeError: only integer tensors of a single element can be converted to an index

I’ve been at this for hours! Can someone help me identify my error here?

I am really, and I mean REALLY dense.

There was no error in my code. I was importing huggingface datasets Dataset instead of importing the torch Dataset class, so it was obviously inheriting the wrong class Dataset when I instantiated my data loader.

Good grief! Don’t be like me!

That’s interesting to hear as I also executed your code locally before reading your update here and it works fine using the Dataset class from PyTorch.
The posted error messages you were seeing are also quite confusing.