I’m trying to get the built-in pytorch
TransformerEncoder to do a classification task; my eventual goal is to replicate the
ToBERT model from this paper (paperswithcode is empty). Unfortunately, my model doesn’t seem to learn anything.
import torch.nn as nn class Net(nn.Module): def __init__( self, embeddings, nhead=8, nhid=200, num_layers=2, dropout=0.1, classifier_dropout=0.1, max_len=256, ): super().__init__() self.d_model = embeddings.size(1) assert ( self.d_model % nhead == 0 ), "nheads must divide evenly into d_model" self.emb = nn.Embedding.from_pretrained(embeddings, freeze=False) self.pos_encoder = PositionalEncoding( self.d_model, dropout=dropout, vocab_size=embeddings.size(0) ) encoder_layers = nn.TransformerEncoderLayer( d_model=self.d_model, nhead=nhead, dim_feedforward=nhid, dropout=dropout ) self.transformer_encoder = nn.TransformerEncoder( encoder_layers, num_layers=num_layers ) self.classifier = nn.Sequential( # Other layers to go here if needed once things seem to be working nn.Linear(self.d_model, 2), ) def forward(self, x): x = self.emb(x) * math.sqrt(self.d_model) x = self.pos_encoder(x) x = self.transformer_encoder(x) # self.src_mask) x = x.mean(dim=1) return self.classifier(x)
import torch.nn as nn import math class PositionalEncoding(nn.Module): def __init__(self, d_model, vocab_size=5000, dropout=0.1, batch_size=100): super().__init__() self.dropout = nn.Dropout(p=dropout) pe = torch.zeros(vocab_size, d_model) position = torch.arange(0, vocab_size, dtype=torch.float).unsqueeze(1) div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)) pe[:, 0::2] = torch.sin(position * div_term) pe[:, 1::2] = torch.cos(position * div_term) pe = pe.unsqueeze(0) self.register_buffer('pe', pe) def forward(self, x): x = x + self.pe[:, :x.size(1), :] return self.dropout(x)
PositionalEncoding layer is taken almost directly from the pytorch language modeling example, with the exception of changing dimensions to match my preference for
I found a couple of examples of transformers for classification:
Both of these seem to work with good accuracy, so I’m sure it’s possible, but both also seem to build the transformer “from scratch.” I’d like to figure out why I can’t get it to work with the pytorch
When I run, my loss even on my training set doesn’t go anywhere, so it’s clearly just not learning. I’ve tried going through the
PositionalEncoding layer a few times, since that’s where much of the complexity lies and even tried replacing it with the positional encoding strategies used in the libraries above – no difference.
Does anyone see something I’m doing obviously wrong? Am I mistaken that I should be able to use a
TransformerEncoder for classification in this way?
Many thanks in advance!