Kaggle TPU roberta Finetuning

I have pretrained roberta-base on dna promoter sequences of plants (working on a project). I am currently trying to finetune it on a downstream task of predicting gene expression values, basically a list of 8 values (corresponding to various tissues) from a single promoter sequence.

This wasn’t possible on kaggle’s gpu (due to memory restrictions), so I tried to do the same on TPU using pytorch-xla (figured that was the best option). The link to the notebook as well as the datasets used are as follows:

  1. Main Kaggle Notebook
  2. Dataset containing code and data

Version 43 is the one using the pytorch-xla code (as far as I could figure out). The data’s format is as follows:

sequence \t labels
dna_promoter_seq_here list_of_8_values_here

eg: CTCAAGCTGAGCAGTGGGTTTGCTCTGGAGGGGAAGCTCAACGGTGGCGACAAGGAAGAATCTGCTTGCGAGGCGAGCCCTGACGCCGCTGATAGCGACCAAAGGTGGATTAAACAACCCATTTCATCATTCTTCTTCCTTGTTAGTTATGATTCCCACGCTTGCCTTTCATGAATCATGATCCTATATGTATATTGATATTAATCAGTTCTAGAAAGTTCAACAACATTTGAGCATGTCAAAACCTGATCGTTGCCTGTTCCATGTCAACAGTGGATTATAACACGTGCAAATGTAGCTATTTGTGTGAGAAGACGTGTGATCGACTCTTTTTTTATATAGATAGCATTGAGATCAACTGTTTGTATATATCTTGTCATAACATTTTTACTTCGTAGCAACGTACGAGCGTTCACCTATTTGTATATAAGTTATCATGATATTTATAAGTTACCGTTGCAACGCACGGACACTCACCTAGTATAGTTTATGTATTACAGTACTAGGAGCCCTAGGCTTCCAATAACTAGAAAAAGTCCTGGTCAGTCGAACCAAACCACAATCCGACGTATACATTCTGGTTCCCCCACGCCCCCATCCGTTCGATTCA [54.679647, 60.646678, 54.9113, 78.878474, 21.326259, 27.973276, 17.419968, 40.465529]

There’s 7,22,000 examples of this kind, ~722 mb in total divided into ~400 mb train, 200 mb test and 100 mb eval. When running the code “finetune.py”, all goes well till the training starts (datasets are loaded, processed, etc). But, the latest run took 3+ hrs to get to the next step and the RAM usage kept on increasing. It looked the TPU run was very slow and the run then crashed as it ran out of memory. I have tried accelerate and trainer but those efforts were in vain.

Few questions:

  1. Is my approach correct?
  2. What changes should I make?
  3. Can I run this code using HuggingFace Trainer (was originally used in the code)? If so, how?
  4. Is the RAM usage normal?
  5. Should it take this long?

If I pass the model as an arg to xmp.spawn, I end up seeing either of “Check failed: data()->tensor_data” or “RuntimeError: Function AddcmulBackward0 returned an invalid gradient at index 1 - expected device xla:1 but got xla:0”. Why?

Kindly guide.