I’m trying to understand how to properly use the generate_sp_model
output as a tokenizer.
A simplified coding example is as follows:
import torch
import io
import csv
from torchtext.data.functional import generate_sp_model, load_sp_model, sentencepiece_tokenizer, sentencepiece_numericalizer
from collections import Counter
from torchtext.vocab import Vocab
list_a = ["sentencepiece encode as pieces", "examples to try!"]
with open('sample.csv', 'w', newline='', encoding='utf-8') as f:
writer=csv.writer(f)
writer.writerows(list_a)
generate_sp_model('sample.csv',vocab_size=21,model_prefix='sample')
vocab_tokenizer = load_sp_model('sample.model') #spmodel is a tokenizer
sp_tokens_generator=sentencepiece_tokenizer(sp_model=vocab_tokenizer)
sp_id_generator= sentencepiece_numericalizer(vocab_tokenizer)
print(list(sp_id_generator(list_a)))
print(list(sp_tokens_generator(list_a)))
def build_vocab(filepath, tokenizer):
counter = Counter()
with io.open(filepath, encoding="utf8") as f:
for string_ in f:
counter.update(tokenizer(string_))
return Vocab(counter, specials=['<unk>', '<pad>', '<bos>', '<eos>'])
en_vocab=build_vocab('sample.csv', vocab_tokenizer)
The print outs are fine. It just doesn’t seem to like being called as a tokenizer. The error given:
Traceback (most recent call last):
File "scratches/scratch_42.py", line 33, in <module>
en_vocab=build_vocab('sample.csv', vocab_tokenizer)
File "scratches/scratch_42.py", line 30, in build_vocab
counter.update(tokenizer(string_))
TypeError: 'torch._C.ScriptObject' object is not callable
Does that error basically mean I should not be using this generate_sp_model
function directly?
https://pytorch.org/text/stable/data_functional.html#torchtext.data.functional.generate_sp_model