Questions and Help
I am writing a AlbertTokenizer based preprocessor which can be serialized using Torchscript.
One of the preprocessor steps is doing unicode normalization (ref implementation in HFs’s AlbertTokenizer)
outputs = unicodedata.normalize("NFKD", outputs)
However, when I try to serialize the preprocesor with this code, I get following error
Python builtin <built-in function normalize> is currently not supported in Torchscript:
File "<ipython-input-9-dcefac57d541>", line 53
outputs = text
if not self.keep_accents:
outputs = unicodedata.normalize("NFKD", outputs)
~~~~~~~~~~~~~~~~~~~~~ <--- HERE
# outputs = "".join([c for c in outputs if not unicodedata.combining(c)])
return outputs
What is the recommended way to perform Unicode Normalization while making sure the code is serializable?