I have a simple RNN-based model for Named Entity Recognition (NER) which works pretty well on a common dataset. I quickly get the loss down to <4 (only relevant for a later comparison) and from expecting the predicted NE tags on test sample, the results look very good. I’m not looking for SOTA results here :).
Now I would like to do the same with a Transformer-based model – and I’m fairly new to transformers. Intuitively, I would simply consider a nn.TransformerEncoder
(incl. Positional Embedding), and push the output through some additional nn.Linear
layers. I’ve added the code for my model at the very end.
In principle, the models seems to train but much longer than the RNN-based model, and the loss goes only down to around 80. I’ve tried different number of layers, heads, encoder dimension sizes, learning rates. When trying some test samples, it’s doesn’t look awfully bad but certainly not great. So right now I’m a bit stuck, mainly having the following questions:
-
Is my model suitable to begin with, or do I have some fundamental misunderstanding about the purpose of the Transformer encoder?
-
When looking for examples for NER, all seem to use BERT or alternatives. Is there a principle reason why these are used instead of a Vanilla Transformer model? (again, I’m not looking for SOTA result but to understand the Transformer architecture)
Here’s the code of my model. Note that the RNN-based model is basically the same just with a LSTM layer instead of a Transformer encoder, and works much better after training in much less time.
class TransformerNER(nn.Module):
def __init__(self, params):
super().__init__()
self.params = params
# Embeddings for tokens and POS tags
self.embed_words = nn.Embedding(self.params.vocab_size_words, params.tf_model_size // 2)
self.embed_pos = nn.Embedding(self.params.vocab_size_pos, params.tf_model_size // 2)
# Positional encoding layer
self.pos_encoder = PositionalEncoding(
d_model=params.tf_model_size,
dropout=params.tf_dropout,
vocab_size=params.vocab_size_words,
)
# Transformer block
encoder_layers = nn.TransformerEncoderLayer(params.tf_model_size, params.tf_num_heads, params.tf_hidden_size, params.tf_dropout)
self.transformer_encoder = nn.TransformerEncoder(encoder_layers, params.tf_num_layers)
# Fully connected layers (incl. Dropout and Activation)
linear_sizes = [params.tf_model_size] + params.linear_hidden_sizes
self.linears = nn.ModuleList()
for i in range(len(linear_sizes)-1):
# Add Dropout layer if probality > 0
if params.linear_dropout > 0.0:
self.linears.append(nn.Dropout(p=params.linear_dropout))
self.linears.append(nn.Linear(linear_sizes[i], linear_sizes[i+1]))
self.linears.append(nn.ReLU())
self.out = nn.Linear(linear_sizes[-1], params.vocab_size_tag)
def forward(self, X):
batch_size, seq_len = X.shape
X_words, X_pos = torch.split(X, seq_len//2, dim=1)
X_words = self.embed_words(X_words)
X_pos = self.embed_pos(X_pos)
# Combine word and POS tag features
X = torch.cat([X_words, X_pos], dim=2)
# Push through positional encoding layer
X = self.pos_encoder(X)
outputs = self.transformer_encoder(X)
for l in self.linears:
outputs = l(outputs)
# Return outputs
return self.out(outputs)