Vanilla Transformer for NER?

I have a simple RNN-based model for Named Entity Recognition (NER) which works pretty well on a common dataset. I quickly get the loss down to <4 (only relevant for a later comparison) and from expecting the predicted NE tags on test sample, the results look very good. I’m not looking for SOTA results here :).

Now I would like to do the same with a Transformer-based model – and I’m fairly new to transformers. Intuitively, I would simply consider a nn.TransformerEncoder (incl. Positional Embedding), and push the output through some additional nn.Linear layers. I’ve added the code for my model at the very end.

In principle, the models seems to train but much longer than the RNN-based model, and the loss goes only down to around 80. I’ve tried different number of layers, heads, encoder dimension sizes, learning rates. When trying some test samples, it’s doesn’t look awfully bad but certainly not great. So right now I’m a bit stuck, mainly having the following questions:

  • Is my model suitable to begin with, or do I have some fundamental misunderstanding about the purpose of the Transformer encoder?

  • When looking for examples for NER, all seem to use BERT or alternatives. Is there a principle reason why these are used instead of a Vanilla Transformer model? (again, I’m not looking for SOTA result but to understand the Transformer architecture)

Here’s the code of my model. Note that the RNN-based model is basically the same just with a LSTM layer instead of a Transformer encoder, and works much better after training in much less time.

class TransformerNER(nn.Module):
    
    def __init__(self, params):
        super().__init__()
        self.params = params
        
        # Embeddings for tokens and POS tags
        self.embed_words = nn.Embedding(self.params.vocab_size_words, params.tf_model_size // 2)
        self.embed_pos = nn.Embedding(self.params.vocab_size_pos, params.tf_model_size // 2)
        
        # Positional encoding layer
        self.pos_encoder = PositionalEncoding(
            d_model=params.tf_model_size,
            dropout=params.tf_dropout,
            vocab_size=params.vocab_size_words,
        )        
        
        # Transformer block
        encoder_layers = nn.TransformerEncoderLayer(params.tf_model_size, params.tf_num_heads, params.tf_hidden_size, params.tf_dropout)
        self.transformer_encoder = nn.TransformerEncoder(encoder_layers, params.tf_num_layers)
        
        # Fully connected layers (incl. Dropout and Activation)
        linear_sizes = [params.tf_model_size] + params.linear_hidden_sizes
        
        self.linears = nn.ModuleList()
        for i in range(len(linear_sizes)-1):
            # Add Dropout layer if probality > 0
            if params.linear_dropout > 0.0:
                self.linears.append(nn.Dropout(p=params.linear_dropout))
            self.linears.append(nn.Linear(linear_sizes[i], linear_sizes[i+1]))
            self.linears.append(nn.ReLU())
        
        self.out = nn.Linear(linear_sizes[-1], params.vocab_size_tag)        
        
        
        
    def forward(self, X):
        batch_size, seq_len = X.shape
        
        X_words, X_pos = torch.split(X, seq_len//2, dim=1)
        
        X_words = self.embed_words(X_words)
        X_pos = self.embed_pos(X_pos)
        
        # Combine word and POS tag features
        X = torch.cat([X_words, X_pos], dim=2)
        
        # Push through positional encoding layer
        X = self.pos_encoder(X)        
        
        outputs = self.transformer_encoder(X)
        
        for l in self.linears:
            outputs = l(outputs)
        
        # Return outputs
        return self.out(outputs)        

At a high level you want to use BERT because these models are very very hard to train from scract and if you use a pretrained BERT the parameters have converged to a very good place and they do well on two tasks (NSP and MLM); if you put a sentence through the model you’ll get good contextual embeddings for each word. Doing this from scratch is very expensive (you need lots of data, lots of GPU, efficient code, etc). Typically then what you do is you put a small lego block on top of BERT and you “fine tune” the parameters of this block to do what you want. For example, maybe you add a FF network on top of each predicted embedding and then a softmax layer linking the embedding to the NE you want (maybe there are 10 NE classes, etc). You can then optimize the new model and (1) adjust ALL parameters (BERT + the new lego - this is expensive) or (2) adjust just the FF parts needed for the NER task (cheaper). Let me know if this makes sense or if you had another question …

1 Like

Here is a good link I just found: Named Entity Recognition with BERT in PyTorch | by Ruben Winastwan | Towards Data Science … The original paper also talks about this idea, but this has an example …

1 Like

@dreidizzle Thanks for your reply! I get the basic idea behind BERT as pretrained model for transfer learning. No doubt that this is what you would do in practice for a real-world application/model.

I’m just trying to gauge my basic understanding of transformers; I’m not aiming for a SOTA model :). Here, for example, I simply try to overfit the model on a very small dataset, e.g., only 100-1000 NER-labeled sentences. In the beginning the loss goes down reasonably well but never goes down towards 0 (like the RNN-based model).

I get that Transformers are difficult to train, and maybe I didn’t train long enough or not with the right optimizer / learning rate / etc. So maybe I should rephrase my question: Why can’t I overfit a Vanilla Transformer for NER on a small toy dataset?

Hm I guess let me ask some questions / give some comments. 1) Yes, these models are very hard to train. Honestly, the big moves in the field seem to come from teams that not only have very smart people but also have crazy infra. What is your infra? 2) How big is your data set? 3) How many layers and heads are you using? How many dimensions are your vectors? If they are close to BERT or the 2017 paper scale, this might not be feasible unless you match their infra. 4) What is your it/s and are you printing the loss per batch or per epoch (I guess so, and it is is not moving)? 5) Unsure about the optimization, but you might need some sort of decay on the learning rate (i.e. no progress → up it by 2, gradient blow up → cut it in half) … If you want to understand the transformer well, Jay Allamar has a very good tutorial https://jalammar.github.io/ and Sasha Rush has the annotated transformer, which is the vanilla transformer from 2017 all in PyTorch The Annotated Transformer (there are experiments with the loss going down, too!)… I personally like [2010.06467] Pretrained Transformers for Text Ranking: BERT and Beyond the most … This all being said, you SHOULD be able to get this to run (but it will perform not as well) if you really shrink down the model, but just want to make sure you are.