Finetuning RoBERTa on a custom Entailment data

holoholger · May 22, 2020, 2:58pm

Hi there,

I want to use fairseq / RoBERTa as a sentence-pair classification task (entailment), similar to the demo code here: https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.md

Use RoBERTa for sentence-pair classification tasks:
roberta = torch.hub.load('pytorch/fairseq', 'roberta.large.mnli')
tokens = roberta.encode('Roberta is a heavily optimized version of BERT.', 'Roberta is not very optimized.')
roberta.predict('mnli', tokens).argmax()  # 0: contradiction

Two questions:

How do I fine tune the RoBERTa model with my own dataset for entailment?
I’ve seen and understood the IMDB example for a custom classification task. But I’m not sure how I have to pre-process the data for entailment tasks rather than 1/0 classification tasks on one single text rather than a text pair
I have a dataset which is a bit different from the MultiNLI dataset in the following way:

My dataset uses a lot of technical terms (think “deoxyribonucleic acid” and words like that). Will finetuning do the job to work on these words / my custom dictionary?
I only have “neutral” and “entailment” training data, but I also don’t really need “contradiction”. Will this work anyway?

I’d highly appreciate some help or hints