How do I choose the correct n-gram for my BLEU-score calculation? Clearly, the below translations are very similar, but if I chose BLEU-4 (the default), then the score would be close to 0, which isn’t very representative. I could just choose BLEU-1, since it’s the highest. But is that like cheating? I don’t know which one is most appropriate.
reference: [[['i', 'want', 'some', 'ice', 'cream', '.']]]
hypothesis: [['i', 'want', 'an', 'ice', 'cream', '.']]
In your case (according to what you’ve demonstrated), you are using BLEU to evaluate individual sentences.
In its original form, BLEU score’s formula uses multiplication of precision scores of i grams ( i = 1, 2, 3, 4). Since the two sentences you show here do not have any 4-grams in common, it’s why BLEU-4 is close to 0.
The reason why BLEU-4 isn’t very representative as you noted is because BLEU isn’t designed to evaluate individual sentences. It’s a corpus based metric and is only good for its purpose when statistics used to calculate it are accumulated over an entire corpus.
As you mention, BLEU uses n-grams – these n-grams are such statistics that do not much make sense (or are less meaningful) when calculated for individual sentences. Moreover, the BLEU score cannot be factorized for individual sentences.
Apart from this, BLEU as a metric suffers from problems even when used to evaluate large corpuses, an example of which is its inability to distinguish between content words and function words (a, the etc.)
So, I’d say along with BLEU, you should evaluate your model translations using other metrics/ways as well.
Hope this helps,