Loading a csv with a column of strings and a column of integers

Hello! I’m an amateur regarding pytorch. I have a csv file with a column of text and a column of integer labels. How can I load this csv into the dataloader so that I can train a model for classification?

This is my code which gives an exception.

import torch
import pandas as pd

train = pd.read_csv("/content/Q_V_1.08.csv")
train_tensor = torch.tensor(train.values)
TypeError                                 Traceback (most recent call last)
<ipython-input-44-83fe164cf357> in <module>()
      4 train = pd.read_csv("/content/Q_V_1.08.csv")
----> 5 train_tensor = torch.tensor(train.values)

TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

Hi! The most direct way would probably be to create a custom Dataset for your files. It’s quite straightforward, you just inherit the generic class and define the __len__ (in your case just return len(train)) and __getitem__ methods, as described in the tutorial. Good luck!

In the tutorial, the getitem method converts image path to a tensor using read_image. The text in my csv file is not an image path. How would I convert it to a tensor?

Also, after creating the custom dataset, how would I pass it into the dataloader and split the dataset into training and test sets?

Regarding __getitem__, you can customize it to return whatever it is you want to use in your training loop. For example, in your case you may try something like this:

class CustomDataset(Dataset):
    def __init__(self, csv_file):
        self.data = pd.read_csv(csv_file, header=None)
    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        row = self.data.iloc[idx]
        sample = {'text': row[0], 'number': row[1]}
        return sample

dataset = CustomDataset("/content/Q_V_1.08.csv")
for foo in dataset:
    print(foo["text"], foo["number"])

alpha 100
bravo 200
charlie 300
delta 400

Regarding split and test, you might do this which I found by Googling “pytorch train test split”.