How to implement BERT using torch.nn.Transformer?

Currently, I use nn.TransformerEncoder to implement BERT.
An example of a BERT architecture:

encoder_layer = nn.TransformerEncoderLayer(d_model=embedding_size, nhead=num_heads)
bert = nn.Sequential(
    nn.TransformerEncoder(encoder_layer, num_layers=num_encoder_layers),
    nn.Linear(embedding_size, output_vocab_size)
)

How do I achieve the same using the nn.Transformer API?

The doc says:

Users can build the BERT model with corresponding parameters.

Even if I set num_decoder_layers=0 while initializing it, the forward() call mandatorily requires the argument tgt for the transformer’s decoder, but BERT has no decoder.
So how do we go about it?

Bump… Anyone?

Note: Though I am aware of HuggingFace's BERT out-of-the-box, for simple non-NLP experiments using custom BERT-like small architectures, I think using PyTorch alone should suffice. PLMK if I’m wrong.

struggling with the same thing actually and still using the from scratch implementation help needed

Why do you want to do that? Isn’t your initial code using TransformerEncoder simple enough?