Unicode Normalization with Torchscript

:question: Questions and Help

I am writing a AlbertTokenizer based preprocessor which can be serialized using Torchscript.

One of the preprocessor steps is doing unicode normalization (ref implementation in HFs’s AlbertTokenizer)

outputs = unicodedata.normalize("NFKD", outputs)

However, when I try to serialize the preprocesor with this code, I get following error

Python builtin <built-in function normalize> is currently not supported in Torchscript:
  File "<ipython-input-9-dcefac57d541>", line 53
        outputs = text
        if not self.keep_accents:
            outputs = unicodedata.normalize("NFKD", outputs)
                      ~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    #             outputs = "".join([c for c in outputs if not unicodedata.combining(c)])
        return outputs

What is the recommended way to perform Unicode Normalization while making sure the code is serializable?

The best way to do this is to use the custom operator API to bind in a unicode normalization op.

Thank you @Michael_Suo for your response. Is there any reference example that you can share for writing such custom operator?

This tutorial is the best place to start: https://pytorch.org/tutorials/advanced/torch_script_custom_ops.html. Of course it is not about unicode normalization; you may be have to find a cpp library that implements it (similar to what is done with opencv in the tutorial)

1 Like