Fast mosestokenizer
TLDR
pip install fast-mosestokenizer
For a fast moses tokenizer.
I’d like to share a tool
Hey all,
Since the people here are all folks who dabble with NLP. I hope to share a tool that you guys would hopefully find useful.
Moses tokenization is a really popular tokenizer for many languages and gets used in research, business, and personal projects.
I wrapped the c++ tokenizer from the moses dedecoder package into a standalone library called fast-mosestokenizer
https://github.com/mingruimingrui/fast-mosestokenizer.
The biggest benefit of using it is probably cross-language compatibility. In the package, an interface for python is provided and uploaded to PyPI.
Feel free to download and give it a try.
pip install fast-mosestokenizer
>>> import mosestokenizer
>>> tokenizer = mosestokenizer.MosesTokenizer(
aggressive_dash_splits=True,
refined_punct_splits=True,
)
>>> tokenizer.tokenize("""
And you'd be right — in both cases.
But en can also be translated as "at,"
"about," "by," "on top of," "upon,"
"inside of" and other ways,
so it's use isn't as straightforward as it may appear.
""")
[
'And', 'you', "'d", 'be', 'right', '—', 'in', 'both', 'cases', '.',
'But', 'en', 'can', 'also', 'be', 'translated', 'as', '"', 'at', ',', '"',
'"', 'about', ',', '"', '"', 'by', ',', '"', '"', 'on', 'top', 'of', ',', '"', '"', 'upon', ',', '"',
'"', 'inside', 'of', '"', 'and', 'other', 'ways', ',',
'so', 'it', "'s", 'use', 'is', "n't", 'as', 'straightforward', 'as', 'it', 'may', 'appear', '.'
]
The package is still in beta stages. You being an early adopter is going to help me out a lot in terms of further developing this package.