What is tokenize's type in Field

dedf21972b190932c756 · May 11, 2019, 1:11pm

Hi, I am new to deep learning.
I have tired to make sentiment analysis, but I have some problems.
Usually, many people code like this:
TEXT = data.Field(tokenize = ‘spacy’)
LABEL = data.LabelField(dtype = torch.float)

But, I want to make Korean version.
I can’t use spacy, because spacy doesn’t offer Korean.
There is a my question.
How to make custom tokenize class?
I don’t know what type to return, list? str?
What kind of return types does tokenize want in data.Field?
Please answer.
Thank you !

JamesTrick · May 12, 2019, 10:18pm

Hiya!

I’m not familiar with the Korean language, so this answer will be a bit general.

With torchtext, you’re able to pass your own tokenizer to it. In a simple but extensible example check this out.

def tokenizer(text):
    """
    Function to tokenize given string of text.
    # Arguments:
        text: (str) String to be tokenized.
    # Returns:
        List of tokens.
   """
   return str.split() # This returns the list of tokens.

You can then pass it into torchtext using the following. TEXT = data.Field(tokenize=tokenizer)

To extend it to Korean, you’ll need to find or develop a tokenizer function that works for Korean

Hopefully that helps!

dedf21972b190932c756 · May 14, 2019, 4:55am

Hi ~
Thank you for comments!!!
I found the Korean tokenizer and applied it ^^
But, I am also curious about how to treat when text is NULL.
I’d like to remove whole train data.
Type is torchtext.data.example.Example.
Do you know about that?