Using the same module and its attribute in the forward pass

amjass · March 20, 2024, 11:24am

Hi!, similar to Calling a layer multiple times will produce the same weights?,

I am building a seq2seq transformer model - I want to add a classifier head based on the encoder outputs - I know that i can deconstruct the model and re-build using encoder and decoder layers separately - however, would the following work based on Language Translation with nn.Transformer and torchtext — PyTorch Tutorials 2.2.1+cu121 documentation

class Seq2SeqTransformer(nn.Module):
    def __init__(self,
                 num_encoder_layers: int,
                 num_decoder_layers: int,
                 emb_size: int,
                 nhead: int,
                 src_vocab_size: int,
                 tgt_vocab_size: int,
                 dim_feedforward: int = 512,
                 dropout: float = 0.1):
        super(Seq2SeqTransformer, self).__init__()
        self.transformer = Transformer(d_model=emb_size,
                                       nhead=nhead,
                                       num_encoder_layers=num_encoder_layers,
                                       num_decoder_layers=num_decoder_layers,
                                       dim_feedforward=dim_feedforward,
                                       dropout=dropout)
        self.generator = nn.Linear(emb_size, tgt_vocab_size)
        self.src_tok_emb = TokenEmbedding(src_vocab_size, emb_size)
        self.tgt_tok_emb = TokenEmbedding(tgt_vocab_size, emb_size)
        self.positional_encoding = PositionalEncoding(
            emb_size, dropout=dropout)
        **self.classifier = nn.Linear(emb_size, 2)**

    def forward(self,
                src: Tensor,
                trg: Tensor,
                src_mask: Tensor,
                tgt_mask: Tensor,
                src_padding_mask: Tensor,
                tgt_padding_mask: Tensor,
                memory_key_padding_mask: Tensor):
        src_emb = self.positional_encoding(self.src_tok_emb(src))
        tgt_emb = self.positional_encoding(self.tgt_tok_emb(trg))
        **outs_encoder = self.transformer.encoder(src_emb, src_mask)**
        **outs_encoder = outs_encoder.mean(dim=0)**
        outs_decoder = self.transformer(src_emb, tgt_emb, src_mask, tgt_mask, None,
                                src_padding_mask, tgt_padding_mask, memory_key_padding_mask)
        return self.generator(outs_decoder), self.classifier(outs_encoder)

Does using self.transformer for the entire encoder-decoder pathway AND using self.transformer.encoder to get ONLY the encoder outputs interfere with weight sharing when calling forward? are gradient calculations still valid this way?

Thank you!

ptrblck · March 20, 2024, 2:23pm

I would assume all calculated gradients are valid unless PyTorch raises an error. You should not use the internal .data attribute (I did not see it in your code) as Autograd wont be able to track these operations.

amjass · March 20, 2024, 2:45pm

Thank you for the quick reply @ptrblck! No errors are thrown - can you clarify the specific data attribute - I am not calling anything other than what is in the class in my original post - the rest is in the training loop:

train_data, test_data = torch.utils.data.random_split(transformer_dataset, [train_split, test_split])

train_dataloader = DataLoader(train_data, batch_size=128, shuffle=True, collate_fn=collate_fn)
val_dataloader = DataLoader(test_data, batch_size=128, shuffle=True, collate_fn=collate_fn)

vocab_len = len(vocab)
model = TransformerProm(vocab_len=vocab_len,d_model=200, num_classes=7)

for p in model.parameters():
    if p.dim() > 1:
        nn.init.xavier_uniform_(p)

nn.init.normal_(model.src_embedding.embedding.weight, mean=0., std=math.pow(model.d_model, -0.5))
model.tgt_embedding.embedding.weight = model.src_embedding.embedding.weight
model.decoder_output.weight = model.tgt_embedding.embedding.weight

model.train()
    losses = 0
    for src, tgt, labs, ages in train_dataloader:
        src = src.to(device)
        print(src)
        tgt = tgt.to(device)
        labels = labs.to(device)

        tgt_input = tgt[:-1, :]

        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

        logits = model(src, tgt_input, src_mask, tgt_mask, src_padding_mask, tgt_padding_mask, src_padding_mask)

        optimizer.zero_grad()
#loss and backwards pass....

I would assume that in each forward pass self.transformer calls on EXACTLY the same forward calculation for the encoder part as self.transformer.encoder does so can define and use the outputs of both without issue?