Confusion about using .clone

Nils_Smitham86 · March 16, 2024, 7:53am

nope, a more realistic application of clone is in seq2seq model, which related to more than one decode steps.

look at this code from huggingface bart seq2seq model:

huggingface/transformers/blob/main/src/transformers/models/bart/modeling_bart.py#L1559


      
          def get_decoder(self):
              return self.decoder
          
          @add_start_docstrings_to_model_forward(BART_INPUTS_DOCSTRING)
          @add_code_sample_docstrings(
              checkpoint=_CHECKPOINT_FOR_DOC,
              output_type=Seq2SeqModelOutput,
              config_class=_CONFIG_FOR_DOC,
              expected_output=_EXPECTED_OUTPUT_SHAPE,
          )
          def forward(
              self,
              input_ids: torch.LongTensor = None,
              attention_mask: Optional[torch.Tensor] = None,
              decoder_input_ids: Optional[torch.LongTensor] = None,
              decoder_attention_mask: Optional[torch.LongTensor] = None,
              head_mask: Optional[torch.Tensor] = None,
              decoder_head_mask: Optional[torch.Tensor] = None,
              cross_attn_head_mask: Optional[torch.Tensor] = None,
              encoder_outputs: Optional[List[torch.FloatTensor]] = None,
              past_key_values: Optional[List[torch.FloatTensor]] = None,

github.com

huggingface/transformers/blob/main/src/transformers/models/bart/modeling_bart.py#L100


      
              indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
              max_seqlen_in_batch = seqlens_in_batch.max().item()
              cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.int32), (1, 0))
              return (
                  indices,
                  cu_seqlens,
                  max_seqlen_in_batch,
              )
          
          
          def shift_tokens_right(input_ids: torch.Tensor, pad_token_id: int, decoder_start_token_id: int):
              """
              Shift input ids one token to the right.
              """
              shifted_input_ids = input_ids.new_zeros(input_ids.shape)
              shifted_input_ids[:, 1:] = input_ids[:, :-1].clone()
              shifted_input_ids[:, 0] = decoder_start_token_id
          
              if pad_token_id is None:
                  raise ValueError("self.model.config.pad_token_id has to be defined.")
              # replace possible -100 values in labels by `pad_token_id`

there are two branch from input_ids, the first is itself , the second is the decoder_input_ids, which needs shift operation with inplace modification on input_ids. On the other hand, the forward function need keep gradient for all input_ids element, as well as input_ids element in decoder_input_ids.

so you should use clone in this occasion