Finetuning RoBERTa on a custom Entailment data

Hi there,

I want to use fairseq / RoBERTa as a sentence-pair classification task (entailment), similar to the demo code here:

Use RoBERTa for sentence-pair classification tasks:
roberta = torch.hub.load('pytorch/fairseq', 'roberta.large.mnli')
tokens = roberta.encode('Roberta is a heavily optimized version of BERT.', 'Roberta is not very optimized.')
roberta.predict('mnli', tokens).argmax()  # 0: contradiction

Two questions:

  1. How do I fine tune the RoBERTa model with my own dataset for entailment?
    I’ve seen and understood the IMDB example for a custom classification task. But I’m not sure how I have to pre-process the data for entailment tasks rather than 1/0 classification tasks on one single text rather than a text pair

  2. I have a dataset which is a bit different from the MultiNLI dataset in the following way:

  • My dataset uses a lot of technical terms (think “deoxyribonucleic acid” and words like that). Will finetuning do the job to work on these words / my custom dictionary?
  • I only have “neutral” and “entailment” training data, but I also don’t really need “contradiction”. Will this work anyway?

I’d highly appreciate some help or hints

1 Like