Hi, I am new to deep learning.
I have tired to make sentiment analysis, but I have some problems.
Usually, many people code like this:
TEXT = data.Field(tokenize = ‘spacy’)
LABEL = data.LabelField(dtype = torch.float)
But, I want to make Korean version.
I can’t use spacy, because spacy doesn’t offer Korean.
There is a my question.
How to make custom tokenize class?
I don’t know what type to return, list? str?
What kind of return types does tokenize want in data.Field?
Please answer.
Thank you !
I’m not familiar with the Korean language, so this answer will be a bit general.
With torchtext, you’re able to pass your own tokenizer to it. In a simple but extensible example check this out.
def tokenizer(text):
"""
Function to tokenize given string of text.
# Arguments:
text: (str) String to be tokenized.
# Returns:
List of tokens.
"""
return str.split() # This returns the list of tokens.
You can then pass it into torchtext using the following. TEXT = data.Field(tokenize=tokenizer)
To extend it to Korean, you’ll need to find or develop a tokenizer function that works for Korean
Hi ~
Thank you for comments!!!
I found the Korean tokenizer and applied it ^^
But, I am also curious about how to treat when text is NULL.
I’d like to remove whole train data.
Type is torchtext.data.example.Example.
Do you know about that?