Hi there,
I want to use fairseq / RoBERTa as a sentence-pair classification task (entailment), similar to the demo code here: https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.md
Use RoBERTa for sentence-pair classification tasks:
roberta = torch.hub.load('pytorch/fairseq', 'roberta.large.mnli')
tokens = roberta.encode('Roberta is a heavily optimized version of BERT.', 'Roberta is not very optimized.')
roberta.predict('mnli', tokens).argmax() # 0: contradiction
Two questions:
-
How do I fine tune the RoBERTa model with my own dataset for entailment?
I’ve seen and understood the IMDB example for a custom classification task. But I’m not sure how I have to pre-process the data for entailment tasks rather than 1/0 classification tasks on one single text rather than a text pair -
I have a dataset which is a bit different from the MultiNLI dataset in the following way:
- My dataset uses a lot of technical terms (think “deoxyribonucleic acid” and words like that). Will finetuning do the job to work on these words / my custom dictionary?
- I only have “neutral” and “entailment” training data, but I also don’t really need “contradiction”. Will this work anyway?
I’d highly appreciate some help or hints