I am writing a AlbertTokenizer based preprocessor which can be serialized using Torchscript.
One of the preprocessor steps is doing unicode normalization (ref implementation in HFs’s AlbertTokenizer)
outputs = unicodedata.normalize("NFKD", outputs)
However, when I try to serialize the preprocesor with this code, I get following error
Python builtin <built-in function normalize> is currently not supported in Torchscript: File "<ipython-input-9-dcefac57d541>", line 53 outputs = text if not self.keep_accents: outputs = unicodedata.normalize("NFKD", outputs) ~~~~~~~~~~~~~~~~~~~~~ <--- HERE # outputs = "".join([c for c in outputs if not unicodedata.combining(c)]) return outputs
What is the recommended way to perform Unicode Normalization while making sure the code is serializable?