Create a torchscript version of Tokenizer in Bert

I want to create an executable version of Tokenizer for Bert - Below is a small code piece:

from transformers import AutoTokenizer, AutoModel
import torch

sentences = ['This framework generates embeddings for each input sentence']
tokenizer_model = AutoTokenizer.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2", torchscript=True)
encoded_input = tokenizer_model(sentences, padding=True, truncation=True, max_length=128, return_tensors='pt')

# !!! complains that 'tokenizer_model' doesn't have eval()
tokenizer_model.eval();    

# !!! tokenizer_model takes list of sentences as inputs, how should I provide tensorial dummpy inputs?
traced_tokenizer_model = torch.jit.trace(tokenizer_model, dummpy_inputs)
torch.jit.save(traced_tokenizer_model, "traced_tokenize_bert.pt")

My first problem is that tokenizer_model doesn’t have eval()

My second problem is that the tokenizer_model takes as inputs a list of strings. How am I supposed to provide tensorial form dummy inputs to create the traced_tokenizer_model?

2 Likes

This question seems to be specific to the HuggingFace repository, so could you explain what AutoTokenizer returns? As it doesn’t seem to be a plain nn.Module, I guess it might be a custom class? If so, could you check if this custom object defines the expected __init__ and forward methods?

you will find the solution of your problem at - TorchScript — transformers 3.0.2 documentation
And can you please tell me why you are using TorchScript ?

There is no answer to the question in the link you sent. I can take the traced version of Transformers-based huggingface models and use it as an inference on the cpp side. But the problem here is with the tokenizer. How can we encode texts on cpp side as similar solution is not offered in tokenizers. No documentation suggests a solution for this. When this issue is mentioned, both on github and on the huggingface forums, irrelevant answers are given and the topic is closed. The developers of 2 institutions ignore this issue.

I wouldn’t claim this issue is ignored from our side as I’ve asked for the definition of AutoTokenizer as well as it’s usage without any follow-up (maybe the issue was solved already by the author?).
If you are hitting the same issue, feel free to post a minimal, executable code snippet and provide the necessary information which were previously asked.

Yeah this is actually a big practical issue for productionizing Huggingface models.

A little background:

  • Huggingface is a model library that contains implementations of many tokenizers and transformer architectures, as well as a simple API for loading many public pretrained transformers with these architectures, and supports both Tensorflow and Torch versions of many of these models.
  • A “model” must be paired with a “tokenizer”, where the tokenizer typically takes in python strings and outputs tensors, and the model takes in the tensors output by the tokenizer and outputs tensors according to its transformer’s architecture. If a different tokenizer is used than the model was originally trained with, the model will have undefined performance and likely not produce meaningful results at all.
  • The pairing from tokenizer to model is not explicit, but is managed through the Auto* api loading from the same external path.
  • AutoTokenizer and AutoModel are not real classes and don’t need to be traced or compatible with TorchScript; they’re an API for loading in pretrained tokenizer+model pairs, and so the output type of AutoTokenizer.from_pretrained will depend on the serialized data of the pretrained model you’re loading, but never inherit from AutoTokenizer. The actual implementations are described here and for instance a commonly used implementation is PreTrainedTokenizerFast.
  • Huggingface supports TorchScript export for Models (specifically some subset of their model library that has supported Torch implementations), but does not support TorchScript export for their tokenizers.

Since the tokenizers are paired 1:1 with a model and the tokenizer can’t be serialized with TorchScript, the benefit of serializing the model with TorchScript is extremely limited; any production scenario is still going to have to identify the correct tokenizer implementation for a specific serialized model, run a full python runtime and load in that tokenizer from a separate resource, and run tokenization in that python process to produce tensors compatible with the model.

The concrete request here is to expand TorchScript’s python language feature coverage to support the main tokenizer implementations in Huggingface, and work with Huggingface to make sure this support is maintained as they continue to develop their library. In order to practically and sustainably productionize popular pretrained text models, TorchScript and Huggingface together should support an API something like

class TextTransformer:
  def __init__(self, path):
    self.tokenizer: PreTrainedFastTokenizer = AutoTokenizer.from_pretrained(path)
    self.model: BertModel = AutoModel.from_pretrained(path)
  def sentence_embeddings(self, sentences: list[str]) -> torch.Tensor:
    tokens = self.tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
    return self.model(**tokens.to(self.model.device), return_dict=True).last_hidden_state[:, 0].cpu()

torch.jit.script(TextTransformer("s3://path-to-model")).save("serialized_text_model.pt")
3 Likes

Reading back through the thread I noticed your request for a “minimal, executable code snippet”, so following up to add one :slight_smile:

>>> import transformers
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-mnli')
>>> type(tokenizer)
<class 'transformers.models.bart.tokenization_bart_fast.BartTokenizerFast'>
>>> import torch
>>> torch.jit.script(tokenizer)
Traceback (most recent call last):
<...snipped traceback for compactness>
torch.jit.frontend.NotSupportedError: Compiled functions can't take variable number of arguments or use keyword-only arguments with defaults: <...snipped>
>>>
2 Likes

Has anyone found a solution to this problem? I’m trying to run the HuggingFace tapas tokenizer in java for my Android app, but cant get the tokenizer to work in java without rewriting all the python code from the ground up. I’m using Deep Java Library but they only support the basic tokenization like Word Piece. If there is a way to run the tokenizer and model with machine code or something, or possibly run the entire colab python code It could work

Any luck here? I have found a Pytorch Lite demo project on GitHub (Question Answering with DistilBert) with a Kotlin function for tokenizing strings (pasted below). I haven’t experimented with that yet to test its suitability with the other models. I don’t know how specific these outputs are to their associated models so this may not be helpful.

    @Throws(QAException::class)
    private fun tokenizer(question: String, text: String): LongArray {
        val tokenIdsQuestion = wordPieceTokenizer(question)
        if (tokenIdsQuestion.size >= MODEL_INPUT_LENGTH) throw QAException("Question too long")
        val tokenIdsText = wordPieceTokenizer(text)
        val inputLength = tokenIdsQuestion.size + tokenIdsText.size + EXTRA_ID_NUM
        val ids = LongArray(Math.min(MODEL_INPUT_LENGTH, inputLength))
        ids[0] = mTokenIdMap!![CLS]!!

        for (i in tokenIdsQuestion.indices) ids[i + 1] = tokenIdsQuestion[i]!!.toLong()
        ids[tokenIdsQuestion.size + 1] = mTokenIdMap!![SEP]!!
        val maxTextLength = Math.min(tokenIdsText.size, MODEL_INPUT_LENGTH - tokenIdsQuestion.size - EXTRA_ID_NUM)

        for (i in 0 until maxTextLength) {
            ids[tokenIdsQuestion.size + i + 2] = tokenIdsText[i]!!.toLong()
        }

        ids[tokenIdsQuestion.size + maxTextLength + 2] = mTokenIdMap!![SEP]!!
        return ids
    }

    private fun wordPieceTokenizer(questionOrText: String): List<Long?> {
        // for each token, if it's in the vocab.txt (a key in mTokenIdMap), return its Id
        // else do: a. find the largest sub-token (at least the first letter) that exists in vocab;
        // b. add "##" to the rest (even if the rest is a valid token) and get the largest sub-token "##..." that exists in vocab;
        // and c. repeat b.
        val tokenIds: MutableList<Long?> = ArrayList()
        val p = Pattern.compile("\\w+|\\S")
        val m = p.matcher(questionOrText)
        while (m.find()) {
            val token = m.group().toLowerCase()
            if (mTokenIdMap!!.containsKey(token)) tokenIds.add(mTokenIdMap!![token]) else {
                for (i in 0 until token.length) {
                    if (mTokenIdMap!!.containsKey(token.substring(0, token.length - i - 1))) {
                        tokenIds.add(mTokenIdMap!![token.substring(0, token.length - i - 1)])
                        var subToken = token.substring(token.length - i - 1)
                        var j = 0

                        while (j < subToken.length) {
                            if (mTokenIdMap!!.containsKey("##" + subToken.substring(0, subToken.length - j))) {
                                tokenIds.add(mTokenIdMap!!["##" + subToken.substring(0, subToken.length - j)])
                                subToken = subToken.substring(subToken.length - j)
                                j = subToken.length - j
                            } else if (j == subToken.length - 1) {
                                tokenIds.add(mTokenIdMap!!["##$subToken"])
                                break
                            } else j++
                        }
                        break
                    }
                }
            }
        }
        return tokenIds
    }