"BLEU: 1.3885571752318349 TER: 102.44788993634243" (wonky BLEU and TER scores)

Cian · April 25, 2024, 6:20am

If I’m not mistaken, BLEU and TER are both supposed to be a number 0-1, with higher numbers better for BLEU and lowers numbers better for TER.

I’m not sure if the error is from this code itself, or if something got seriously messed up in the actual corpora, but here’s what I ran:

import sacrebleu
def evaluate_translation(reference_path, hypothesis_path):
    with open(reference_path, 'r', encoding='utf-8') as ref_file:
        refs = [ref_file.read().split('\n')]
    with open(hypothesis_path, 'r', encoding='utf-8') as hyp_file:
        hyps = hyp_file.read().split('\n')
    bleu = sacrebleu.corpus_bleu(hyps, refs)
    ter = sacrebleu.corpus_ter(hyps, refs)
    print(f"BLEU: {bleu.score}")
    print(f"TER: {ter.score}")
evaluate_translation('/(PATH)/eng-fra-test.txt-filtered.fr.subword.desubword', '/(PATH)/eng-fra.fr.translated.desubword')

And it’s in the title, but here are the results:

BLEU: 1.3885571752318349
TER: 102.44788993634243

The above is for English to French. When I tried French to English before, it spat out a perfectly normal BLEU score of about 0.57, but still with a wonky TER of about 112. Because the BLEU score came out perfectly fine, I have no idea where the problem is coming from.