Hi,
I am training my own model on Google Colab TPU. I read this tutorial notebook and try to train my model on Multi TPU Core but I got some error.
It seem to be a long issue but I think it is not too complicated. I’ve only recently been using Pytorch and I really don’t know how to fix this. Please help me.
About my model:
class BERTModel(nn.Module):
def __init__(self,...):
super().__init__()
if ...:
self.bert_model = XLMRobertaModel.from_pretrained(...) # huggingface XLM-R
elif ...:
self.bert_model = others_model.from_pretrained(...) # huggingface XLM-R
... # some other model's parameters
def forward(self,...):
bert_input = ...
output = self.bert_model(bert_input)
... # some function that process on output
def other_function(self,...):
# just doing some process on output. like concat layers's embedding and return ...
class MAINModel(nn.Module):
def __init__(self,...):
super().__init__()
print('Using model 1')
self.bert_model_1 = BERTModel(...)
print('Using model 2')
self.bert_model_2 = BERTModel(...)
self.linear = nn.Linear(...)
def forward(self,...):
bert_input = ...
bert_output = self.bert_model(bert_input)
linear_output = self.linear(bert_output)
return linear_output
Then I copy def map_fn
from this tutorial notebook and change a bit:
-
Please take a look at my function, I have commented change that I make: https://ideone.com/6CvRmo
-
I remove download dataset part in original
map_fn
because I use my own TensorDataset:map_fn(index, flags, train_dataset, dev_dataset):
-
I have three optimizers. Original
map_fn
has only one:#original: xm.optimizer_step(optimizer) # my: xm.optimizer_step(optimizer_1) xm.optimizer_step(optimizer_2) xm.optimizer_step(optimizer_3)
-
Others are the same.
Finaly, I run: xmp.spawn(map_fn, args=(flags,train_dataset, dev_dataset,), nprocs=8, start_method='fork')
And I got ouput (I print out device right below line: device = xm.xla_device()
):
DEIVCE: xla:1
Using model 1
DEIVCE: xla:0
Using model 1
DEIVCE: xla:0
Using model 1
DEIVCE: xla:0
Using model 1
DEIVCE: xla:0
Using model 1
DEIVCE: xla:0
Using model 1
DEIVCE: xla:0
Using model 1
DEIVCE: xla:0
Using model 1
And errors:
Exception Traceback (most recent call last)
<ipython-input-77-e32013c52d88> in <module>()
----> 1 xmp.spawn(map_fn, args=(flags,train_dataset, dev_dataset,), nprocs=8, start_method='fork')
2 frames
/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py in spawn(fn, args, nprocs, join, daemon, start_method)
393 join=join,
394 daemon=daemon,
--> 395 start_method=start_method)
396
397
/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
155
156 # Loop on join until it returns True or raises an exception.
--> 157 while not context.join():
158 pass
159
/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py in join(self, timeout)
105 raise Exception(
106 "process %d terminated with signal %s" %
--> 107 (error_index, name)
108 )
109 else:
Exception: process 0 terminated with signal SIGSEGV
I guess the problem is in my model class part (BERTModel(), MAINModel()
). Because the output printed is:
DEIVCE: xla:0 # <----- most output is xla:0 not xla:1,2,3,4,5,6,7
Using model 1 # <----- always print: "Using model 1"" not "Using model 2"
But I tried to fed one single input batch to MAINModel()
and it return output as I expected.
I’m really sorry for the long issue. But I really don’t know how to fix this. Please help me. Thanks you so so much.