Arabic text processing

Aiman_Mutasem-bellh · September 13, 2019, 12:32pm

Hello friend,

I have a question about how to manipulating Arabic text from right to left with seq2seq model tasks like (word embedding - softmax function). I think by default (CNN and RNN) manipulate text left to right like English language, and this may affect negatively on results accuracy.

Nikronic · September 13, 2019, 1:23pm

Hi,

Actually I have never done project on texts but this medium article may help you, because they have built a Persian-English translator. You may know, Persian and Arabic both use same scripts (right to left, same characters except 4).

Good luck

Aiman_Mutasem-bellh · September 14, 2019, 12:55am

Thank you, Mr. Lakhani, for help

I was read the article carefully, and there is no cite to right-to-left text processing. This article uses align word or sentence, and this enough for this case.

I have a new question for you.
Do you have an idea about right-to-left seq2seq model in PyTorch?

vdw · September 14, 2019, 10:38am

I cannot see why there should be any fundamental differences resulting from the reading direction of a language. Nothing truly changes.

RNNs process sequences. It doesn’t matter if the first item of a sequence represents the first left or the first right word of a sentence. Diagrams of RNNs are commonly represented processing left to right. Well, just flip it. An RNN has no reading direction :). If you use a bidirectional RNN, then it’s a non-issue to begin with.

A CNN is even less a concern since it does go over a sequence step by step. So there’s no concept of left to right or right to left. For CNN it should even matter where the padding is – before or after a sentence in the sentence is shorter then the max input of the CNN.

In short, if you prepare your data properly, you can use any existing network model that works for English. Some minor considerations like padding might come into play, but that’s not something to worry about when getting started.

Nikronic · September 14, 2019, 11:13am

You are welcome,

@vdw’s answer is complete thanks to him. Actually, this is the reason you might not noticed the right-to-left consideration in the article I had linked.

Aiman_Mutasem-bellh · September 14, 2019, 1:49pm

@vdw and @Nikronic thank you

my point is in some tasks may I need to process text sequentially right-to-left and left-to-right like Grammer Error Correction (GEC) tasks. To extract features and relationship between words it will be better to do rtl and ltr processing. some types of errors are easier to be corrected by a right-to-left seq2seq model, while some are more likely to be corrected by a left-to-right seq2seq model.

for example, if I have this sentence
"She likes playing in park and come here every week "

so, when I start from right-to-left I can easily correct an error like,
"She likes playing in the park and come here every week "

when I reverse the process from left-to-right, I will fix correctly an error like:
“She likes playing in the park and comes here every week”

from this point, I need to apply the both approaches to get better results, and this is my question. Please, check the below link:

vdw · September 14, 2019, 2:56pm

I still don’t know what is stopping you. Assume you have a sentence “A B C D E F” (no matter the language), you can given the model either “A B C D E F” or “F E D C B A”. Or you have two networks, one for each order. Or one bidirectional network.

I’m not saying that the task of GEC is trivial. I’m just saying there’s no difference between English and Arabic due to the reading direction.

Aiman_Mutasem-bellh · September 15, 2019, 3:05am

Dear Mr. @vdw
thank you for your time and your cooperation.

I got your point in the previous comment " there’s no difference between English and Arabic due to the reading direction".
Actually, this is my fault. I didn’t explain my question in a clear and direct way, and I’m sorry about that.

Recently, I tried to develop my first model in GEC based on CNN. Everything worked well, and I got promising results. Now, I’m going to improve results based on the paper idea, that I was posted in the previous comment. the next paragraph is copy pest from the paper.

“Based on the idea of multi-round correction, we further propose an advanced fluency boost inference approach: round-way error correction. Instead of progressively correcting a sentence with the same seq2seq model as introduced, round-way correction corrects a sentence through a right-to-left seq2seq model and a left-to-right seq2seq model successively.”

Accordingly, I’m looking for a way to apply a right-to-left seq2seq model and a left-to-right seq2seq model, just to improve results, not a language matter.

This stage is just for applying several approaches I have read in some papers.

I hope now I explained my question clearly.

Again, I’m so sorry for the misunderstanding and spending your time.

vdw · September 17, 2019, 5:04am

I finally read the paper. From what I understand, they simply train two Seq2Seq models side by side: one processing sentences left to right, the other from right to left.

As far as I can tell, the training of both models is done in parallel, just with opposite sequence orders. For inference, the run an incorrect input sentence alternatively through both models until the output is “good enough” or doesn’t change.

In short, you just have to train two models – well, one model in terms of its architecture but with different datasets, where the difference is solely in the order of the sentences.

zarat · September 17, 2019, 10:49am

I had this problem in Adobe premier but i solved it

Aiman_Mutasem-bellh · October 15, 2019, 1:57am

Okay, thank you so match