Not able to append multiple batches to single output whilst tokenizing

seanfarr788 · January 18, 2022, 5:10pm

I am trying to batch iterate over my dataset. So say my dataset is 5000 lines long, I am dividing this into 1000 blocks, and then passing this blocks through a loop, where I am tokenizing the block and then appending the result to the original ‘inputs’.

The loop then goes back to the top, and gets the next batch however in doing so it is overwriting the previous block, so the resultant product is a tensor with length of 1000 (containing the last 1000 block)

def batchReader(Dataset, block_size=1000):
    block = []
    for line in Dataset:
        block.append(line)
        if len(block) == block_size:
            yield block
            block = []
    if block:
        yield block

input_ids = []
attention_masks = []
token_type_ids = []
count=0
with open('5000.txt') as Dataset:
    blocks = batchReader(Dataset)
    for block in blocks:
        inputs = tokenizer(
                block,     
                truncation=True,
                max_length=512,
                padding='max_length',
                return_tensors='pt', 
            )
        input_ids.append(inputs['input_ids'])
        attention_masks.append(inputs['attention_mask'])
        token_type_ids.append(inputs['token_type_ids'])
        inputs['labels'] = inputs.input_ids.detach().clone()

I am looking for the output of this code to yield a tensor with 5000 length, any help much appreciated!