Hello all,

I’m trying to get the built-in pytorch `TransformerEncoder`

to do a classification task; my eventual goal is to replicate the `ToBERT`

model from this paper (paperswithcode is empty). Unfortunately, my model doesn’t seem to learn anything.

```
import torch.nn as nn
class Net(nn.Module):
def __init__(
self,
embeddings,
nhead=8,
nhid=200,
num_layers=2,
dropout=0.1,
classifier_dropout=0.1,
max_len=256,
):
super().__init__()
self.d_model = embeddings.size(1)
assert (
self.d_model % nhead == 0
), "nheads must divide evenly into d_model"
self.emb = nn.Embedding.from_pretrained(embeddings, freeze=False)
self.pos_encoder = PositionalEncoding(
self.d_model, dropout=dropout, vocab_size=embeddings.size(0)
)
encoder_layers = nn.TransformerEncoderLayer(
d_model=self.d_model, nhead=nhead, dim_feedforward=nhid, dropout=dropout
)
self.transformer_encoder = nn.TransformerEncoder(
encoder_layers, num_layers=num_layers
)
self.classifier = nn.Sequential(
# Other layers to go here if needed once things seem to be working
nn.Linear(self.d_model, 2),
)
def forward(self, x):
x = self.emb(x) * math.sqrt(self.d_model)
x = self.pos_encoder(x)
x = self.transformer_encoder(x) # self.src_mask)
x = x.mean(dim=1)
return self.classifier(x)
```

```
import torch.nn as nn
import math
class PositionalEncoding(nn.Module):
def __init__(self, d_model, vocab_size=5000, dropout=0.1, batch_size=100):
super().__init__()
self.dropout = nn.Dropout(p=dropout)
pe = torch.zeros(vocab_size, d_model)
position = torch.arange(0, vocab_size, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0)
self.register_buffer('pe', pe)
def forward(self, x):
x = x + self.pe[:, :x.size(1), :]
return self.dropout(x)
```

The `PositionalEncoding`

layer is taken almost directly from the pytorch language modeling example, with the exception of changing dimensions to match my preference for `batch_first=True`

.

There are few similar posts, all without definite answers.

I found a couple of examples of transformers for classification:

Both of these seem to work with good accuracy, so I’m sure it’s possible, but both also seem to build the transformer “from scratch.” I’d like to figure out why I can’t get it to work with the pytorch `TransformerEncoder`

.

When I run, my loss even on my *training* set doesn’t go anywhere, so it’s clearly just not learning. I’ve tried going through the `PositionalEncoding`

layer a few times, since that’s where much of the complexity lies and even tried replacing it with the positional encoding strategies used in the libraries above – no difference.

Does anyone see something I’m doing obviously wrong? Am I mistaken that I should be able to use a `TransformerEncoder`

for classification in this way?

Many thanks in advance!