Where should i implement the preprocessing code for text? (e.g seq2seq learning)

yunjey · February 22, 2017, 11:33am

I want to preprocess the text data, for example, converting each word to index and adding some pads (for seq2seq learning). Is it good way to handle this as below?

   class MyDataset(torch.data.utils.Dataset):
        def __init__(self):
            self.data_files = os.listdir('data_dir')
            sort(self.data_files)

        def __getitem__(self, idx):
            data =  load_file(self.data_files[idx])
            data = preprocess_data(data)            # preprocess
            return data 

        def __len__(self):
            return len(self.data_files)


dset = MyDataset()
loader = torch.data.utils.DataLoader(dset, num_workers=8)

smth · February 22, 2017, 3:19pm

yes, this is a good way to handle text loading.

You can also look at the torchtext package for more complex examples:

Spider101 · March 26, 2017, 12:30pm

The repo looks really good. Is there a potential timeline on when it might be released on pip?