Unicode Normalization with Torchscript

anjali-chadha · November 30, 2021, 4:39pm

Questions and Help

I am writing a AlbertTokenizer based preprocessor which can be serialized using Torchscript.

One of the preprocessor steps is doing unicode normalization (ref implementation in HFs’s AlbertTokenizer)

outputs = unicodedata.normalize("NFKD", outputs)

However, when I try to serialize the preprocesor with this code, I get following error

Python builtin <built-in function normalize> is currently not supported in Torchscript:
  File "<ipython-input-9-dcefac57d541>", line 53
        outputs = text
        if not self.keep_accents:
            outputs = unicodedata.normalize("NFKD", outputs)
                      ~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    #             outputs = "".join([c for c in outputs if not unicodedata.combining(c)])
        return outputs

What is the recommended way to perform Unicode Normalization while making sure the code is serializable?

Michael_Suo · November 30, 2021, 5:01pm

The best way to do this is to use the custom operator API to bind in a unicode normalization op.

anjali-chadha · November 30, 2021, 7:45pm

Thank you @Michael_Suo for your response. Is there any reference example that you can share for writing such custom operator?

Michael_Suo · November 30, 2021, 8:04pm

This tutorial is the best place to start: https://pytorch.org/tutorials/advanced/torch_script_custom_ops.html. Of course it is not about unicode normalization; you may be have to find a cpp library that implements it (similar to what is done with opencv in the tutorial)