Exception: process 0 terminated with signal SIGKILL

mobassir94 · April 6, 2020, 3:20pm

i was using this notebook : https://www.kaggle.com/theoviel/bert-pytorch-huggingface-with-tpu-multiprocessing

to finetune huggingface’s xlm roberta base model on jigsaw multilingual (ongoing kaggle competition)

this is my first time with torch xla and TPU multiprocessing…!

the code i am trying is exactly this one : https://pastebin.com/fS94MKYc on kaggle kernel which gives TPU v3-8

but even for batch_size = 8 i see my jupyter notebook crashes after giving this error message : Your notebook tried to allocate more memory than is available. It has restarted.

where i can see other people are using same model with even batch_size = 64

full error message looks like this :

Exception Traceback (most recent call last)
in

/opt/conda/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py in spawn(fn, args, nprocs, join, daemon, start_method)

180 join=join,
181 daemon=daemon,
–> 182 start_method=start_method)

/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)

156
157 # Loop on join until it returns True or raises an exception.
–> 158 while not context.join():
159 pass
160

/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py in join(self, timeout)

106 raise Exception(
107 “process %d terminated with signal %s” %
–> 108 (error_index, name)
109 )
110 else:

Exception: process 0 terminated with signal SIGKILL

so i am not understanding exactly where in my code i need to make change so that it can work? it seems like the problem is not with the batch size but something else that i am unable to catch.please help,thanks in advance

vfdev-5 · April 6, 2020, 3:38pm

Hi @mobassir94, you put category “Ignite”, but it seems not to be related to PyTorch-Ignite, right ?

mobassir94 · April 6, 2020, 3:39pm

@vfdev-5 sorry i didn’t find xla there in category

vfdev-5 · April 6, 2020, 3:43pm

No problems, maybe we can ask @smth for that.

machouz · August 8, 2020, 6:49pm

@mobassir94 can you check if this work

Brando_Miranda · March 1, 2021, 5:23pm

did you ever figure out a) what was causing your issue and b) how you solved it?

Thanks in advance!

mobassir94 · March 17, 2021, 8:24am

@Brando_Miranda yes, it’s OOM issue,i reduce batch size,recent version of torch xla is much better