Exception: process 0 terminated with signal SIGKILL

i was using this notebook : https://www.kaggle.com/theoviel/bert-pytorch-huggingface-with-tpu-multiprocessing

to finetune huggingface’s xlm roberta base model on jigsaw multilingual (ongoing kaggle competition)

this is my first time with torch xla and TPU multiprocessing…!

the code i am trying is exactly this one : https://pastebin.com/fS94MKYc on kaggle kernel which gives TPU v3-8

but even for batch_size = 8 i see my jupyter notebook crashes after giving this error message : Your notebook tried to allocate more memory than is available. It has restarted.

where i can see other people are using same model with even batch_size = 64

full error message looks like this :


  1. Exception Traceback (most recent call last)

  2. in

  • /opt/conda/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py in spawn(fn, args, nprocs, join, daemon, start_method)
  1. 180 join=join,

  2. 181 daemon=daemon,

  3. –> 182 start_method=start_method)

  • /opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
  1. 156

  2. 157 # Loop on join until it returns True or raises an exception.

  3. –> 158 while not context.join():

  4. 159 pass

  5. 160

  • /opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py in join(self, timeout)
  1. 106 raise Exception(

  2. 107 “process %d terminated with signal %s” %

  3. –> 108 (error_index, name)

  4. 109 )

  5. 110 else:

  • Exception: process 0 terminated with signal SIGKILL

so i am not understanding exactly where in my code i need to make change so that it can work? it seems like the problem is not with the batch size but something else that i am unable to catch.please help,thanks in advance

Hi @mobassir94, you put category “Ignite”, but it seems not to be related to PyTorch-Ignite, right ?

@vfdev-5 sorry i didn’t find xla there in category

No problems, maybe we can ask @smth for that.

1 Like

@mobassir94 can you check if this work

did you ever figure out a) what was causing your issue and b) how you solved it?

Thanks in advance!

@Brando_Miranda yes, it’s OOM issue,i reduce batch size,recent version of torch xla is much better