Training gets stuck after 1 epoch

I am trying to train a BERT based model, but the model seems to get stuck after 1 epoch. It doesn’t proceed further. I am training the BERT from scratch with my custom dataset containing around 13900000 training data points and around 4500000 testing and validation, each. The training works fine for smaller data size say around 1000000 data points, but for the above data size, it doesn’t move further even after 2 hours. There is no error, just the next epoch doesn’t execute, but the cell is still running for hours without proceeding. I am running it on Nvidia GeForce RTX 3090.

Hi, can you post a reduced code snippet that reproduces the error?

class SentencePairClassifier(nn.Module):
    def __init__(self, bert_model="allenai/scibert_scivocab_uncased", freeze_bert=False):
        super(SentencePairClassifier, self).__init__()
        #  Instantiating BERT-based model object
        self.bert_layer = AutoModel.from_pretrained(bert_model)

   
        if bert_model == "allenai/scibert_scivocab_uncased":  # 12M parameters
            hidden_size = 768
        elif bert_model == "albert-large-v2":  # 18M parameters
            hidden_size = 1024
        elif bert_model == "albert-xlarge-v2":  # 60M parameters
            hidden_size = 2048
        elif bert_model == "albert-xxlarge-v2":  # 235M parameters
            hidden_size = 4096
        elif bert_model == "bert-base-uncased": # 110M parameters
            hidden_size = 768

        # Freeze bert layers and only train the classification layer weights
        if freeze_bert:
            for p in self.bert_layer.parameters():
                p.requires_grad = False

        # Classification layer
        self.cls_layer = nn.Linear(hidden_size, 1)

        self.dropout = nn.Dropout(p=0.1)

    @autocast()  # run in mixed precision
    def forward(self, input_ids, attn_masks, token_type_ids):
        x = self.bert_layer(input_ids, attn_masks, token_type_ids)
        pooler_output = x[1]
        pooler_output_tanh = torch.tanh(pooler_output)
        pooler_output_atanh = torch.atanh(pooler_output_tanh)
        logits = self.cls_layer(self.dropout(pooler_output_atanh))
        return logits

As i mentioned, this works for smaller data size, but for data set of my size, it doesn’t proceed after 1st epoch

From what you’re describing, it’s probably something to do with the dataloader or the training regime and not the model.

I am not sure, how is it training for first epoch then?

Memory issues usually occur in the training regime, and sometimes in the dataloader. Not in the model. If the gpu has a memory issue, it will crash the script and give a cuda memory error. Ram, on the otherhand, may just keep allocating virtual memory until your pc is rendered unusable.

Anyway, you’ve shown the model and the issue is not reproducible given that code.

what else is needed to reproduce the code?

thank you for the information about memory. Is there any way i can avoid it?

The training regime and dataloader(if custom).

class CustomDataset(Dataset):
    """
    Tokenize each pair of sentence and their description to get token ids, attention masks and token type ids

    *Args:*
    `Dataset`dataset containing data about the papers. 

    *Returns:*
    
      Returns the tokenized version of the dataset.
    """

    def __init__(self, data, maxlen, with_labels=True, bert_model='allenai/scibert_scivocab_uncased'):

        self.data = data
        self.tokenizer = AutoTokenizer.from_pretrained(bert_model)  
        self.maxlen = maxlen
        self.with_labels = with_labels 

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        sent1 = str(self.data.loc[index, 'sentence'])
        sent2 = str(self.data.loc[index, 'substring'])

        # Tokenize the pair of sentences to get token ids, attention masks and token type ids
        encoded_pair = self.tokenizer(sent1, sent2, 
                                      padding='max_length',  # Pad to max_length
                                      truncation=True,  # Truncate to max_length
                                      max_length=self.maxlen,  
                                      return_tensors='pt')  # Return torch.Tensor objects
        
        token_ids = encoded_pair['input_ids'].squeeze(0)  # tensor of token ids
        attn_masks = encoded_pair['attention_mask'].squeeze(0)  # binary tensor with "0" for padded values and "1" for the other values
        token_type_ids = encoded_pair['token_type_ids'].squeeze(0)  # binary tensor with "0" for the 1st sentence tokens & "1" for the 2nd sentence tokens

        if self.with_labels:  # True if the dataset has labels
            label = self.data.loc[index, 'score']
            return token_ids, attn_masks, token_type_ids, label  
        else:
            return token_ids, attn_masks, token_type_ids

If you’re trying to load up all of your data into RAM, that might be where your issue is coming from. Have you tried checking your RAM while the model is running to see if it’s reaching max capacity?

A few ideas you can try that don’t involve purchasing more hardware:

  1. Tokenize all of the text before training. And delete the original pandas frame. This will use less memory.
  2. Load the data as you go.
  3. Use some other type of data holder than pandas. Pandas is not the most memory efficient.
1 Like