I am trying to train a BERT based model, but the model seems to get stuck after 1 epoch. It doesn’t proceed further. I am training the BERT from scratch with my custom dataset containing around 13900000 training data points and around 4500000 testing and validation, each. The training works fine for smaller data size say around 1000000 data points, but for the above data size, it doesn’t move further even after 2 hours. There is no error, just the next epoch doesn’t execute, but the cell is still running for hours without proceeding. I am running it on Nvidia GeForce RTX 3090.
Hi, can you post a reduced code snippet that reproduces the error?
class SentencePairClassifier(nn.Module): def __init__(self, bert_model="allenai/scibert_scivocab_uncased", freeze_bert=False): super(SentencePairClassifier, self).__init__() # Instantiating BERT-based model object self.bert_layer = AutoModel.from_pretrained(bert_model) if bert_model == "allenai/scibert_scivocab_uncased": # 12M parameters hidden_size = 768 elif bert_model == "albert-large-v2": # 18M parameters hidden_size = 1024 elif bert_model == "albert-xlarge-v2": # 60M parameters hidden_size = 2048 elif bert_model == "albert-xxlarge-v2": # 235M parameters hidden_size = 4096 elif bert_model == "bert-base-uncased": # 110M parameters hidden_size = 768 # Freeze bert layers and only train the classification layer weights if freeze_bert: for p in self.bert_layer.parameters(): p.requires_grad = False # Classification layer self.cls_layer = nn.Linear(hidden_size, 1) self.dropout = nn.Dropout(p=0.1) @autocast() # run in mixed precision def forward(self, input_ids, attn_masks, token_type_ids): x = self.bert_layer(input_ids, attn_masks, token_type_ids) pooler_output = x pooler_output_tanh = torch.tanh(pooler_output) pooler_output_atanh = torch.atanh(pooler_output_tanh) logits = self.cls_layer(self.dropout(pooler_output_atanh)) return logits
As i mentioned, this works for smaller data size, but for data set of my size, it doesn’t proceed after 1st epoch
From what you’re describing, it’s probably something to do with the dataloader or the training regime and not the model.
I am not sure, how is it training for first epoch then?
Memory issues usually occur in the training regime, and sometimes in the dataloader. Not in the model. If the gpu has a memory issue, it will crash the script and give a cuda memory error. Ram, on the otherhand, may just keep allocating virtual memory until your pc is rendered unusable.
Anyway, you’ve shown the model and the issue is not reproducible given that code.
what else is needed to reproduce the code?
thank you for the information about memory. Is there any way i can avoid it?
The training regime and dataloader(if custom).
class CustomDataset(Dataset): """ Tokenize each pair of sentence and their description to get token ids, attention masks and token type ids *Args:* `Dataset`dataset containing data about the papers. *Returns:* Returns the tokenized version of the dataset. """ def __init__(self, data, maxlen, with_labels=True, bert_model='allenai/scibert_scivocab_uncased'): self.data = data self.tokenizer = AutoTokenizer.from_pretrained(bert_model) self.maxlen = maxlen self.with_labels = with_labels def __len__(self): return len(self.data) def __getitem__(self, index): sent1 = str(self.data.loc[index, 'sentence']) sent2 = str(self.data.loc[index, 'substring']) # Tokenize the pair of sentences to get token ids, attention masks and token type ids encoded_pair = self.tokenizer(sent1, sent2, padding='max_length', # Pad to max_length truncation=True, # Truncate to max_length max_length=self.maxlen, return_tensors='pt') # Return torch.Tensor objects token_ids = encoded_pair['input_ids'].squeeze(0) # tensor of token ids attn_masks = encoded_pair['attention_mask'].squeeze(0) # binary tensor with "0" for padded values and "1" for the other values token_type_ids = encoded_pair['token_type_ids'].squeeze(0) # binary tensor with "0" for the 1st sentence tokens & "1" for the 2nd sentence tokens if self.with_labels: # True if the dataset has labels label = self.data.loc[index, 'score'] return token_ids, attn_masks, token_type_ids, label else: return token_ids, attn_masks, token_type_ids
If you’re trying to load up all of your data into RAM, that might be where your issue is coming from. Have you tried checking your RAM while the model is running to see if it’s reaching max capacity?
A few ideas you can try that don’t involve purchasing more hardware:
- Tokenize all of the text before training. And delete the original pandas frame. This will use less memory.
- Load the data as you go.
- Use some other type of data holder than pandas. Pandas is not the most memory efficient.