Hi!, similar to Calling a layer multiple times will produce the same weights?,
I am building a seq2seq transformer model - I want to add a classifier head based on the encoder outputs - I know that i can deconstruct the model and re-build using encoder and decoder layers separately - however, would the following work based on Language Translation with nn.Transformer and torchtext — PyTorch Tutorials 2.2.1+cu121 documentation
class Seq2SeqTransformer(nn.Module):
def __init__(self,
num_encoder_layers: int,
num_decoder_layers: int,
emb_size: int,
nhead: int,
src_vocab_size: int,
tgt_vocab_size: int,
dim_feedforward: int = 512,
dropout: float = 0.1):
super(Seq2SeqTransformer, self).__init__()
self.transformer = Transformer(d_model=emb_size,
nhead=nhead,
num_encoder_layers=num_encoder_layers,
num_decoder_layers=num_decoder_layers,
dim_feedforward=dim_feedforward,
dropout=dropout)
self.generator = nn.Linear(emb_size, tgt_vocab_size)
self.src_tok_emb = TokenEmbedding(src_vocab_size, emb_size)
self.tgt_tok_emb = TokenEmbedding(tgt_vocab_size, emb_size)
self.positional_encoding = PositionalEncoding(
emb_size, dropout=dropout)
**self.classifier = nn.Linear(emb_size, 2)**
def forward(self,
src: Tensor,
trg: Tensor,
src_mask: Tensor,
tgt_mask: Tensor,
src_padding_mask: Tensor,
tgt_padding_mask: Tensor,
memory_key_padding_mask: Tensor):
src_emb = self.positional_encoding(self.src_tok_emb(src))
tgt_emb = self.positional_encoding(self.tgt_tok_emb(trg))
**outs_encoder = self.transformer.encoder(src_emb, src_mask)**
**outs_encoder = outs_encoder.mean(dim=0)**
outs_decoder = self.transformer(src_emb, tgt_emb, src_mask, tgt_mask, None,
src_padding_mask, tgt_padding_mask, memory_key_padding_mask)
return self.generator(outs_decoder), self.classifier(outs_encoder)
Does using self.transformer for the entire encoder-decoder pathway AND using self.transformer.encoder to get ONLY the encoder outputs interfere with weight sharing when calling forward? are gradient calculations still valid this way?
Thank you!