I want to create an executable version of Tokenizer for Bert - Below is a small code piece:
from transformers import AutoTokenizer, AutoModel
sentences = ['This framework generates embeddings for each input sentence']
tokenizer_model = AutoTokenizer.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2", torchscript=True)
encoded_input = tokenizer_model(sentences, padding=True, truncation=True, max_length=128, return_tensors='pt')
# !!! complains that 'tokenizer_model' doesn't have eval()
# !!! tokenizer_model takes list of sentences as inputs, how should I provide tensorial dummpy inputs?
traced_tokenizer_model = torch.jit.trace(tokenizer_model, dummpy_inputs)
My first problem is that
tokenizer_model doesn’t have
My second problem is that the
tokenizer_model takes as inputs a list of strings. How am I supposed to provide tensorial form dummy inputs to create the
This question seems to be specific to the HuggingFace repository, so could you explain what
AutoTokenizer returns? As it doesn’t seem to be a plain
nn.Module, I guess it might be a custom class? If so, could you check if this custom object defines the expected
you will find the solution of your problem at - TorchScript — transformers 3.0.2 documentation
And can you please tell me why you are using TorchScript ?
There is no answer to the question in the link you sent. I can take the traced version of Transformers-based huggingface models and use it as an inference on the cpp side. But the problem here is with the tokenizer. How can we encode texts on cpp side as similar solution is not offered in tokenizers. No documentation suggests a solution for this. When this issue is mentioned, both on github and on the huggingface forums, irrelevant answers are given and the topic is closed. The developers of 2 institutions ignore this issue.
I wouldn’t claim this issue is ignored from our side as I’ve asked for the definition of
AutoTokenizer as well as it’s usage without any follow-up (maybe the issue was solved already by the author?).
If you are hitting the same issue, feel free to post a minimal, executable code snippet and provide the necessary information which were previously asked.
Yeah this is actually a big practical issue for productionizing Huggingface models.
A little background:
- Huggingface is a model library that contains implementations of many tokenizers and transformer architectures, as well as a simple API for loading many public pretrained transformers with these architectures, and supports both Tensorflow and Torch versions of many of these models.
- A “model” must be paired with a “tokenizer”, where the tokenizer typically takes in python strings and outputs tensors, and the model takes in the tensors output by the tokenizer and outputs tensors according to its transformer’s architecture. If a different tokenizer is used than the model was originally trained with, the model will have undefined performance and likely not produce meaningful results at all.
- The pairing from tokenizer to model is not explicit, but is managed through the Auto* api loading from the same external path.
- AutoTokenizer and AutoModel are not real classes and don’t need to be traced or compatible with TorchScript; they’re an API for loading in pretrained tokenizer+model pairs, and so the output type of
AutoTokenizer.from_pretrained will depend on the serialized data of the pretrained model you’re loading, but never inherit from AutoTokenizer. The actual implementations are described here and for instance a commonly used implementation is PreTrainedTokenizerFast.
- Huggingface supports TorchScript export for Models (specifically some subset of their model library that has supported Torch implementations), but does not support TorchScript export for their tokenizers.
Since the tokenizers are paired 1:1 with a model and the tokenizer can’t be serialized with TorchScript, the benefit of serializing the model with TorchScript is extremely limited; any production scenario is still going to have to identify the correct tokenizer implementation for a specific serialized model, run a full python runtime and load in that tokenizer from a separate resource, and run tokenization in that python process to produce tensors compatible with the model.
The concrete request here is to expand TorchScript’s python language feature coverage to support the main tokenizer implementations in Huggingface, and work with Huggingface to make sure this support is maintained as they continue to develop their library. In order to practically and sustainably productionize popular pretrained text models, TorchScript and Huggingface together should support an API something like
def __init__(self, path):
self.tokenizer: PreTrainedFastTokenizer = AutoTokenizer.from_pretrained(path)
self.model: BertModel = AutoModel.from_pretrained(path)
def sentence_embeddings(self, sentences: list[str]) -> torch.Tensor:
tokens = self.tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
return self.model(**tokens.to(self.model.device), return_dict=True).last_hidden_state[:, 0].cpu()
Reading back through the thread I noticed your request for a “minimal, executable code snippet”, so following up to add one
>>> import transformers
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-mnli')
>>> import torch
Traceback (most recent call last):
<...snipped traceback for compactness>
torch.jit.frontend.NotSupportedError: Compiled functions can't take variable number of arguments or use keyword-only arguments with defaults: <...snipped>