i was using this notebook : https://www.kaggle.com/theoviel/bert-pytorch-huggingface-with-tpu-multiprocessing
to finetune huggingface’s xlm roberta base model on jigsaw multilingual (ongoing kaggle competition)
this is my first time with torch xla and TPU multiprocessing…!
the code i am trying is exactly this one : https://pastebin.com/fS94MKYc on kaggle kernel which gives TPU v3-8
but even for batch_size = 8 i see my jupyter notebook crashes after giving this error message : Your notebook tried to allocate more memory than is available. It has restarted.
where i can see other people are using same model with even batch_size = 64
full error message looks like this :
-
-
Exception Traceback (most recent call last)
-
in
- /opt/conda/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py in spawn(fn, args, nprocs, join, daemon, start_method)
-
180 join=join,
-
181 daemon=daemon,
-
–> 182 start_method=start_method)
- /opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
-
156
-
157 # Loop on join until it returns True or raises an exception.
-
–> 158 while not context.join():
-
159 pass
-
160
- /opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py in join(self, timeout)
-
106 raise Exception(
-
107 “process %d terminated with signal %s” %
-
–> 108 (error_index, name)
-
109 )
-
110 else:
- Exception: process 0 terminated with signal SIGKILL
so i am not understanding exactly where in my code i need to make change so that it can work? it seems like the problem is not with the batch size but something else that i am unable to catch.please help,thanks in advance