It’s probably very simple, but I just want to play around with RNN/LSTM/etc… with this text->regression problem. My dataset is just a t c g without spaces, of variable length. I’m confused by all the examples out there that use the legacy torchtext, spacy, and default tokenizers (e.g. the english corpus). How should I adapt this dataset?
This is what I have so far. It’s a big dataset >1million so I would like to do it in a sensible way. Without packing (to better understand pytorch API), then with packing (as I saw this is the proper way)
from torch.utils.data import Dataset
#simple pandas dataframe w/ 2 columns for sequence and target
input_data = pd.read_csv('./data/GSE135464_means_nextseq_GPD.csv')
class atcgDataset(Dataset):
def __init__(self,table):
self.seq = np.array(table['Seqs'])
self.values = np.array(table['values'],dtype=np.float32)
def __len__(self):
return len(self.seq)
def __getitem__(self,index):
return self.values[index], self.seq[index]
update; I hacked the tutorial here to get it to work
https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html
But I am still curious about other ways to do it. If you know of any meaningfully better ways, please tell!