Using Pytorch SentencePiece to go from Indices to Text

Hello! I’m using Pytorch’s SentencePiece model again, which is far more efficient than any other tokenizer system I’ve come across. However, I’m having difficulty getting back from indices to tokens(text) using the model. Here is a wish list, which, any one of these would solve the issue for me:

  1. A built-in command to convert from indices to tokens, similar to vocab.get_itos[index] (see here);
  2. A reverse method of sentencepiece_numericalizer;
  3. A method to extract from the saved SP model a dictionary containing the tokens: indices, or vice versa.


The generate_sp_model creates two files, one ending in .model and the other ending in .vocab. Using the following workaround for now:

from torchtext.vocab import build_vocab_from_iterator

filestr=open("m_user.vocab", "r", encoding="utf-8")

for k in range(len(x)):
    x[k]=x[k][:x[k].find("\t")].replace("\t", "")

def yield_tokens(tokenlist):
    for token in tokenlist:
        yield token.strip().split()


And then I can call vocab.get_itos(index) or vocab.get_stoi(text) accordingly. Still looking for an optimal solution, but this gets the job done.