Value of [CLS] Token for Transformer Encoders

tom · October 26, 2022, 8:46am

The typical thing (GPT, vision transformer, …) is to make it a learned parameter, e.g. here:

MathInf/toroidal/blob/bff09f725627e4629d464008dc7c5f9d6322ebad/toroidal/models.py#L18


      
          self.n_channels = n_channels
          num_patches = (input_size ** 2) // (
              patch_size ** 2
          )  # only when this is divisible...
          self.patch_to_vec = torch.nn.Conv2d(
              3,
              n_channels,
              kernel_size=(patch_size, patch_size),
              stride=(patch_size, patch_size),
          )  # equivalent to reshape + linear
          self.class_token = torch.nn.Parameter(
              torch.randn(1, 1, n_channels)
          )  # "global information"
          self.pos_embedding = torch.nn.Parameter(
              torch.randn(1, num_patches + 1, n_channels)
          )
          
          
torch.nn.init.zeros_(self.patch_to_vec.bias)
          torch.nn.init.normal_(self.patch_to_vec.bias, std=0.02)  # scale with size?
          torch.nn.init.normal_(self.class_token, std=0.02)
          torch.nn.init.normal_(self.pos_embedding, std=0.02)

Best regards

Thomas