Problem with training model on Google Colab TPU

ThuatNguyenHuu · December 2, 2020, 10:45pm

Hi,

I am training my own model on Google Colab TPU. I read this tutorial notebook and try to train my model on Multi TPU Core but I got some error.

It seem to be a long issue but I think it is not too complicated. I’ve only recently been using Pytorch and I really don’t know how to fix this. Please help me.

About my model:

class BERTModel(nn.Module):
    def __init__(self,...):
    	super().__init__()
        if ...:
    		self.bert_model = XLMRobertaModel.from_pretrained(...)   # huggingface XLM-R
        elif ...:
            self.bert_model = others_model.from_pretrained(...)   # huggingface XLM-R
    	
        ... # some other model's parameters
    	
    def forward(self,...):
    	bert_input = ...
    	output = self.bert_model(bert_input)
        
    	... # some function that process on output
    	
    def other_function(self,...):
    	# just doing some process on output. like concat layers's embedding and return ...
    	
class MAINModel(nn.Module):
	def __init__(self,...):
		super().__init__()
		
		print('Using model 1')
    	self.bert_model_1 = BERTModel(...)
        
		print('Using model 2')
		self.bert_model_2 = BERTModel(...)
        
    	self.linear = nn.Linear(...)
    	
    def forward(self,...):
    	bert_input = ...
    	bert_output = self.bert_model(bert_input)
    	linear_output = self.linear(bert_output)
   
    	return linear_output

Then I copy def map_fn from this tutorial notebook and change a bit:

Please take a look at my function, I have commented change that I make: https://ideone.com/6CvRmo
I remove download dataset part in original map_fn because I use my own TensorDataset:

map_fn(index, flags, train_dataset, dev_dataset):

I have three optimizers. Original map_fn has only one:

#original: xm.optimizer_step(optimizer)
#      my: xm.optimizer_step(optimizer_1)
           xm.optimizer_step(optimizer_2)
           xm.optimizer_step(optimizer_3)

Others are the same.

Finaly, I run: xmp.spawn(map_fn, args=(flags,train_dataset, dev_dataset,), nprocs=8, start_method='fork')

And I got ouput (I print out device right below line: device = xm.xla_device()):

DEIVCE:  xla:1
Using  model 1
DEIVCE:  xla:0
Using  model 1
DEIVCE:  xla:0
Using  model 1
DEIVCE:  xla:0
Using  model 1
DEIVCE:  xla:0
Using  model 1
DEIVCE:  xla:0
Using  model 1
DEIVCE:  xla:0
Using  model 1
DEIVCE:  xla:0
Using  model 1

And errors:

Exception                                 Traceback (most recent call last)
<ipython-input-77-e32013c52d88> in <module>()
----> 1 xmp.spawn(map_fn, args=(flags,train_dataset, dev_dataset,), nprocs=8, start_method='fork')

2 frames
/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py in spawn(fn, args, nprocs, join, daemon, start_method)
    393         join=join,
    394         daemon=daemon,
--> 395         start_method=start_method)
    396 
    397 

/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
    155 
    156     # Loop on join until it returns True or raises an exception.
--> 157     while not context.join():
    158         pass
    159 

/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    105                 raise Exception(
    106                     "process %d terminated with signal %s" %
--> 107                     (error_index, name)
    108                 )
    109             else:

Exception: process 0 terminated with signal SIGSEGV

I guess the problem is in my model class part (BERTModel(), MAINModel()). Because the output printed is:

DEIVCE:  xla:0    # <----- most output is xla:0 not xla:1,2,3,4,5,6,7
Using  model 1    # <----- always print: "Using model 1"" not "Using model 2"

But I tried to fed one single input batch to MAINModel() and it return output as I expected.

I’m really sorry for the long issue. But I really don’t know how to fix this. Please help me. Thanks you so so much.