Newbie question ; loading in custom lexicon in torchtext dataset - no legacy

suegi · August 17, 2021, 9:48am

It’s probably very simple, but I just want to play around with RNN/LSTM/etc… with this text->regression problem. My dataset is just a t c g without spaces, of variable length. I’m confused by all the examples out there that use the legacy torchtext, spacy, and default tokenizers (e.g. the english corpus). How should I adapt this dataset?

This is what I have so far. It’s a big dataset >1million so I would like to do it in a sensible way. Without packing (to better understand pytorch API), then with packing (as I saw this is the proper way)

from torch.utils.data import Dataset


#simple pandas dataframe w/ 2 columns for sequence and target
input_data = pd.read_csv('./data/GSE135464_means_nextseq_GPD.csv')

class atcgDataset(Dataset):
    def __init__(self,table):
        self.seq = np.array(table['Seqs'])
        self.values = np.array(table['values'],dtype=np.float32)
    def __len__(self):
        return len(self.seq)
    def __getitem__(self,index):
        return self.values[index], self.seq[index]

update; I hacked the tutorial here to get it to work

https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

But I am still curious about other ways to do it. If you know of any meaningfully better ways, please tell!