Can't overwrite len() in Dataset

ebelle · August 18, 2020, 5:01pm

I’ve created a LazyDataset but I’d like to skip the first line of the file, so I’m shortening the len() by 1 (and then adding 1 to the index in the __get__item(). However, I can’t overwrite the len. I’ve even tried:

def len(self):
return 12

and it’s not working. Full code:

class LazyDataset(Dataset):

def __init__(self, filepath, source_vocab, target_vocab, task):
    self.source_vocab = source_vocab
    self.text_init = self.source_vocab.init_token
    self.text_eos = self.source_vocab.eos_token
    self.target_vocab = target_vocab
    self.target_init = self.target_vocab.init_token
    self.target_eos = self.target_vocab.eos_token
    self.filepath = filepath
    # get total file length
    self._total_data = sum(1 for _ in open(self.filepath, "r"))
    self.task = task
    
def __len__(self):
    "Denotes the total number of samples"
    # subtract 1 from the file len to skip header in getitem
     return self._total_data - 1

def tokens_to_idx(self, text, target):
    # TODO: add arguments to make init & eos optional
    # add init and eos tokens
    if self.task == "translation":
        text = [self.text_init] + text + [self.text_eos]
        target = [self.target_init] + target + [self.target_eos]
    # tokens to indices
    text = [self.source_vocab.vocab.stoi[t] for t in text]
    target = [self.target_vocab.vocab.stoi[t] for t in target]

    return text, target

def __getitem__(self, index):
    "Generates one sample of data"
    # normally you need +1 since linecache indexes from 1
    # here, we skip the header by adding +2 instead of +1
    print(index)
    line = linecache.getline(self.filepath, index+2)
    text, target = line.split("\t")
    # string to list, tokenizing on white space
    text, target = text.split(), target.split()
    text, target = self.tokens_to_idx(text, target)
    text_lens = len(text)
    return text, target, text_lens

user_123454321 · August 18, 2020, 5:15pm

What is the error you get ? And shouldn’t __len__ return the value like this…

def __len__(self):
    "Denotes the total number of samples"
    # subtract 1 from the file len to skip header in getitem
     return self._total_data - 1

ebelle · August 18, 2020, 5:23pm

Yeah, sorry I edited that last minute. I had it as return 12 so I quickly edited it and missed the return. I’ll fix it now.

The error is that it’s pulling from an empty line (the line beyond the last line of text in the file). I want the max index that goes in to be N-1 and the max index it’s getting is still N. But even if I set it to “return 12”, it’s still returning 23,037. If I overload it to “return 12” it should return 12, no?

user_123454321 · August 18, 2020, 5:31pm

It returns 23,037 even if you return 12 ? Weird. Can you show how you are instantiating the dataset and getting the length ?

ebelle · August 18, 2020, 5:35pm

I figured it out! It’s the batch loader I’m using. Thank you so much for your time. I’m sorry for asking a question where you wouldn’t even be able to see the answer. Have a great day!!!

Can't overwrite __len__() in Dataset

Can't overwrite len() in Dataset