I am trying to train a BERT based model, but the model seems to get stuck after 1 epoch. It doesn’t proceed further. I am training the BERT from scratch with my custom dataset containing around 13900000 training data points and around 4500000 testing and validation, each. The training works fine for smaller data size say around 1000000 data points, but for the above data size, it doesn’t move further even after 2 hours. There is no error, just the next epoch doesn’t execute, but the cell is still running for hours without proceeding. I am running it on Nvidia GeForce RTX 3090.
Hi, can you post a reduced code snippet that reproduces the error?
class SentencePairClassifier(nn.Module):
def __init__(self, bert_model="allenai/scibert_scivocab_uncased", freeze_bert=False):
super(SentencePairClassifier, self).__init__()
# Instantiating BERT-based model object
self.bert_layer = AutoModel.from_pretrained(bert_model)
if bert_model == "allenai/scibert_scivocab_uncased": # 12M parameters
hidden_size = 768
elif bert_model == "albert-large-v2": # 18M parameters
hidden_size = 1024
elif bert_model == "albert-xlarge-v2": # 60M parameters
hidden_size = 2048
elif bert_model == "albert-xxlarge-v2": # 235M parameters
hidden_size = 4096
elif bert_model == "bert-base-uncased": # 110M parameters
hidden_size = 768
# Freeze bert layers and only train the classification layer weights
if freeze_bert:
for p in self.bert_layer.parameters():
p.requires_grad = False
# Classification layer
self.cls_layer = nn.Linear(hidden_size, 1)
self.dropout = nn.Dropout(p=0.1)
@autocast() # run in mixed precision
def forward(self, input_ids, attn_masks, token_type_ids):
x = self.bert_layer(input_ids, attn_masks, token_type_ids)
pooler_output = x[1]
pooler_output_tanh = torch.tanh(pooler_output)
pooler_output_atanh = torch.atanh(pooler_output_tanh)
logits = self.cls_layer(self.dropout(pooler_output_atanh))
return logits
As i mentioned, this works for smaller data size, but for data set of my size, it doesn’t proceed after 1st epoch
From what you’re describing, it’s probably something to do with the dataloader or the training regime and not the model.
I am not sure, how is it training for first epoch then?
Memory issues usually occur in the training regime, and sometimes in the dataloader. Not in the model. If the gpu has a memory issue, it will crash the script and give a cuda memory error. Ram, on the otherhand, may just keep allocating virtual memory until your pc is rendered unusable.
Anyway, you’ve shown the model and the issue is not reproducible given that code.
what else is needed to reproduce the code?
thank you for the information about memory. Is there any way i can avoid it?
The training regime and dataloader(if custom).
class CustomDataset(Dataset):
"""
Tokenize each pair of sentence and their description to get token ids, attention masks and token type ids
*Args:*
`Dataset`dataset containing data about the papers.
*Returns:*
Returns the tokenized version of the dataset.
"""
def __init__(self, data, maxlen, with_labels=True, bert_model='allenai/scibert_scivocab_uncased'):
self.data = data
self.tokenizer = AutoTokenizer.from_pretrained(bert_model)
self.maxlen = maxlen
self.with_labels = with_labels
def __len__(self):
return len(self.data)
def __getitem__(self, index):
sent1 = str(self.data.loc[index, 'sentence'])
sent2 = str(self.data.loc[index, 'substring'])
# Tokenize the pair of sentences to get token ids, attention masks and token type ids
encoded_pair = self.tokenizer(sent1, sent2,
padding='max_length', # Pad to max_length
truncation=True, # Truncate to max_length
max_length=self.maxlen,
return_tensors='pt') # Return torch.Tensor objects
token_ids = encoded_pair['input_ids'].squeeze(0) # tensor of token ids
attn_masks = encoded_pair['attention_mask'].squeeze(0) # binary tensor with "0" for padded values and "1" for the other values
token_type_ids = encoded_pair['token_type_ids'].squeeze(0) # binary tensor with "0" for the 1st sentence tokens & "1" for the 2nd sentence tokens
if self.with_labels: # True if the dataset has labels
label = self.data.loc[index, 'score']
return token_ids, attn_masks, token_type_ids, label
else:
return token_ids, attn_masks, token_type_ids
If you’re trying to load up all of your data into RAM, that might be where your issue is coming from. Have you tried checking your RAM while the model is running to see if it’s reaching max capacity?
A few ideas you can try that don’t involve purchasing more hardware:
- Tokenize all of the text before training. And delete the original pandas frame. This will use less memory.
- Load the data as you go.
- Use some other type of data holder than pandas. Pandas is not the most memory efficient.