Just here to share a tool `fast-mosestokenizer`

mingruimingrui · July 15, 2020, 9:53am

Fast mosestokenizer

TLDR

pip install fast-mosestokenizer

For a fast moses tokenizer.

I’d like to share a tool

Hey all,

Since the people here are all folks who dabble with NLP. I hope to share a tool that you guys would hopefully find useful.

Moses tokenization is a really popular tokenizer for many languages and gets used in research, business, and personal projects.

I wrapped the c++ tokenizer from the moses dedecoder package into a standalone library called fast-mosestokenizer https://github.com/mingruimingrui/fast-mosestokenizer.
The biggest benefit of using it is probably cross-language compatibility. In the package, an interface for python is provided and uploaded to PyPI.

Feel free to download and give it a try.

pip install fast-mosestokenizer

>>> import mosestokenizer
>>> tokenizer = mosestokenizer.MosesTokenizer(
    aggressive_dash_splits=True,
    refined_punct_splits=True,
)
>>> tokenizer.tokenize("""
And you'd be right — in both cases.
But en can also be translated as "at,"
"about," "by," "on top of," "upon,"
"inside of" and other ways,
so it's use isn't as straightforward as it may appear.
""")
[
  'And', 'you', "'d", 'be', 'right', '—', 'in', 'both', 'cases', '.',
  'But', 'en', 'can', 'also', 'be', 'translated', 'as', '"', 'at', ',', '"',
  '"', 'about', ',', '"', '"', 'by', ',', '"', '"', 'on', 'top', 'of', ',', '"', '"', 'upon', ',', '"',
  '"', 'inside', 'of', '"', 'and', 'other', 'ways', ',',
  'so', 'it', "'s", 'use', 'is', "n't", 'as', 'straightforward', 'as', 'it', 'may', 'appear', '.'
]

The package is still in beta stages. You being an early adopter is going to help me out a lot in terms of further developing this package.

ptrblck · July 16, 2020, 5:39am

Thanks for sharing!
CC @Thomas_Wolf who might be interested in this.

vdw · July 16, 2020, 10:04am

@mingruimingrui thanks for sharing. May I ask how the tokenizer behaves in case in some example cases below:

How can user mentions (e.g., @DemoUser) or hashtags (e.g., #helloworld) tokenized
Does “This is…a test” get correctly tokenized?
Does "So much fun Good night! get correctly tokenized (please assume no whitespaces before and after the emojis)
Does “Hi :-p” get correctly tokenized, i.e., the emoticon remains token “:-p”

I work a lot with social media, and missing whitespaces, emojis as sentence separator, etc. are very common. Most off-the-shelf tokenizer struggle with most cases. And even some that, for example recognize, common emoticons such as “:-)” struggle with varitions like “:-)))”.

This is why I had to bend over backwards to write my own tokenizer that covers most common cases, but I’m happy to try others.

mingruimingrui · July 16, 2020, 4:04pm

Let me try it out for you.

>>> import tokenizer

>>> tokenizer = mosestokenizer.MosesTokenizer(
    lang='en',
    aggressive_dash_splits=True,
    refined_punct_splits=True,
)

>>> cases = [
    '@DemoUser',
    '#helloworld',
    'This is…a test',
    'This is...a test'
    'So much fun🙂Goodnight!',
    'Hi :-p',
]

>>> for case in cases:
>>>     tokens = tokenizer.tokenize(case)
>>>     print('{} -> {} -> {}'.format(
>>>         case, tokens, tokenizer.detokenize(tokens)))
"@DemoUser -> ['@', 'DemoUser'] -> @ DemoUser"
"#helloworld -> ['#', 'helloworld'] -> # helloworld"
"This is…a test -> ['This', 'is', '…', 'a', 'test'] -> This is … a test"
"This is...a test -> ['This', 'is', '...', 'a', 'test'] -> This is... a test"
"So much fun🙂Goodnight! -> ['So', 'much', 'fun', '🙂', 'Goodnight', '!'] -> So much fun 🙂 Goodnight!"
"Hi :-p -> ['Hi', ':', '@-@', 'p'] -> Hi:-p"

Emoticon protection is probably out-of-scope of this package.
There are also better ways to account for emoticons and special patterns.
If you are looking for an out of the box solution without the need for any prior training
or fine-tuning, https://github.com/cbaziotis/ekphrasis might be something worth looking at.

The main goal of this tokenizer is to perform similar tokenization to the original moses implementation (plus some slight adjustments).

Note that you can also provide protected regex patterns.

Using,

#protected_pattern.en
:-p

You get,

>>> tokens = tokenizer.tokenize('Hi :-p')
>>> print(tokens)
["Hi", ":-p"]
>>> tokenizer.detokenize(tokens)
"Hi :-p"

zeevso · August 27, 2023, 1:30pm

@mingruimingrui Thanks for sharing. I try to install using pip but with no success. I think that is because I use Windows. Any easy way for users who don’t want compiling and building from source code?