SentencePiece Tokenizer

J_Johnson · March 21, 2021, 9:43am

I’m trying to understand how to properly use the generate_sp_model output as a tokenizer.

A simplified coding example is as follows:

import torch
import io
import csv
from torchtext.data.functional import generate_sp_model, load_sp_model, sentencepiece_tokenizer, sentencepiece_numericalizer
from collections import Counter
from torchtext.vocab import Vocab

list_a = ["sentencepiece encode as pieces", "examples to try!"]

with open('sample.csv', 'w', newline='', encoding='utf-8') as f:
    writer=csv.writer(f)
    writer.writerows(list_a)

generate_sp_model('sample.csv',vocab_size=21,model_prefix='sample')
vocab_tokenizer = load_sp_model('sample.model')  #spmodel is a tokenizer
sp_tokens_generator=sentencepiece_tokenizer(sp_model=vocab_tokenizer)
sp_id_generator= sentencepiece_numericalizer(vocab_tokenizer)

print(list(sp_id_generator(list_a)))
print(list(sp_tokens_generator(list_a)))

def build_vocab(filepath, tokenizer):
    counter = Counter()
    with io.open(filepath, encoding="utf8") as f:
        for string_ in f:
            counter.update(tokenizer(string_))
    return Vocab(counter, specials=['<unk>', '<pad>', '<bos>', '<eos>'])

en_vocab=build_vocab('sample.csv', vocab_tokenizer)

The print outs are fine. It just doesn’t seem to like being called as a tokenizer. The error given:

Traceback (most recent call last):
  File "scratches/scratch_42.py", line 33, in <module>
    en_vocab=build_vocab('sample.csv', vocab_tokenizer)
  File "scratches/scratch_42.py", line 30, in build_vocab
    counter.update(tokenizer(string_))
TypeError: 'torch._C.ScriptObject' object is not callable

Does that error basically mean I should not be using this generate_sp_model function directly?

https://pytorch.org/text/stable/data_functional.html#torchtext.data.functional.generate_sp_model

J_Johnson · March 21, 2021, 11:45am

Okay, figured it out. I changed

counter.update(tokenizer(string_))

to:

for elem in list(tokenizer([string_])):
    counter.update(elem)

And also:
en_vocab=build_vocab('sample.csv', vocab_tokenizer)

to:

en_vocab=build_vocab('sample.csv', sp_tokens_generator)

That allowed me to then save the vocab model with:

torch.save(en_vocab, 'sample_vocab')