Vision-Transformer re-shaping the input

Mohamed_Farag · November 30, 2023, 1:59pm

Hi everyone,

I am using a pre-trained vision transformer from Pytorch available models. The question now, if I am using the pre-trained weights and I am applying the same transformations applied to the during training on Image-Net as an example, does is by nature do the reshaping of the input data?

pre_trained_weights = ViT_L_16_Weights.IMAGENET1K_V1

transforms = pre_trained_weights.transforms()

I mean something like this.

Thanks

ptrblck · November 30, 2023, 2:01pm

Unsure, if I understand the question properly, but if you are wondering if the transformation contains a Resize, just print it.

Mohamed_Farag · November 30, 2023, 2:02pm

Yeah, but not exactly. we need to do patching thing to make the data suitable for the transformer, is it done automatically?

J_Johnson · November 30, 2023, 3:20pm

In a ViT model, the model takes in dimensions (batch_size, channels, height, width), just like any standard vision model.

Mohamed_Farag · December 1, 2023, 9:20am

Yeah, but I remember that we need to do the patching and the positional encoding thing with our hands right?
I implemented the TransUnet architecture before, yet I don’t if the VIT in pytorch is doing it somehow, I was investigating the architecture and I found this conv-proj which was actually some CNN with specific kernel size and stride to do the reshaping thing. Also, I see your post taking about the “Stem” thing before. Can you just explain more what are the differences?

Is there positional encoding?

Thanks

J_Johnson · December 1, 2023, 11:36am

A typical ViT in PyTorch will already handle patching the image. So the minimum input size will be specified with the model:

https://pytorch.org/vision/main/models/generated/torchvision.models.vit_b_16.html#torchvision.models.vit_b_16

The STEM just adds some additional conv2d layers in the front before the ViT. You might think of it as preparing the patches to be more easily “recognized” by the embedding layer.

Mohamed_Farag · December 1, 2023, 12:09pm

what is more puzzling the positional encoding part, I can’t understand how is it happening?

class Encoder(nn.Module):
    """Transformer Model Encoder for sequence to sequence translation."""

    def __init__(
        self,
        seq_length: int,
        num_layers: int,
        num_heads: int,
        hidden_dim: int,
        mlp_dim: int,
        dropout: float,
        attention_dropout: float,
        norm_layer: Callable[..., torch.nn.Module] = partial(nn.LayerNorm, eps=1e-6),
    ):
        super().__init__()
        # Note that batch_size is on the first dim because
        # we have batch_first=True in nn.MultiAttention() by default
        self.pos_embedding = nn.Parameter(torch.empty(1, seq_length, hidden_dim).normal_(std=0.02))  # from BERT
        self.dropout = nn.Dropout(dropout)
        layers: OrderedDict[str, nn.Module] = OrderedDict()
        for i in range(num_layers):
            layers[f"encoder_layer_{i}"] = EncoderBlock(
                num_heads,
                hidden_dim,
                mlp_dim,
                dropout,
                attention_dropout,
                norm_layer,
            )
        self.layers = nn.Sequential(layers)
        self.ln = norm_layer(hidden_dim)

    def forward(self, input: torch.Tensor):
        torch._assert(input.dim() == 3, f"Expected (batch_size, seq_length, hidden_dim) got {input.shape}")
        input = input + self.pos_embedding
        return self.ln(self.layers(self.dropout(input)))

I can see what you are saying, I checked the implementation self.pos_embedding = nn.Parameter(torch.empty(1, seq_length, hidden_dim).normal_(std=0.02)). Thanks a lot it was too interesting to have this talk!

And for the patchify stem is utilized for enhancing the transformer performance. Right?

J_Johnson · December 1, 2023, 12:46pm

The embedding layer is just being initialized in the init, giving some random values before training begins. That is used to assign some position information to the transformer, such as first patch, second patch, etc. Similar to how a positional encoder is used in a language model.

STEM is a way to improve the accuracy of a ViT model, but requires additional training if added later.

Mohamed_Farag · December 1, 2023, 12:48pm

lass Encoder(nn.Module):
    """Transformer Model Encoder for sequence to sequence translation."""

    def __init__(
        self,
        seq_length: int,
        num_layers: int,
        num_heads: int,
        hidden_dim: int,
        mlp_dim: int,
        dropout: float,
        attention_dropout: float,
        norm_layer: Callable[..., torch.nn.Module] = partial(nn.LayerNorm, eps=1e-6),
    ):
        super().__init__()
        # Note that batch_size is on the first dim because
        # we have batch_first=True in nn.MultiAttention() by default
        self.pos_embedding = nn.Parameter(torch.empty(1, seq_length, hidden_dim).normal_(std=0.02))  # from BERT
        self.dropout = nn.Dropout(dropout)
        layers: OrderedDict[str, nn.Module] = OrderedDict()
        for i in range(num_layers):
            layers[f"encoder_layer_{i}"] = EncoderBlock(
                num_heads,
                hidden_dim,
                mlp_dim,
                dropout,
                attention_dropout,
                norm_layer,
            )
        self.layers = nn.Sequential(layers)
        self.ln = norm_layer(hidden_dim)

    def forward(self, input: torch.Tensor):
        torch._assert(input.dim() == 3, f"Expected (batch_size, seq_length, hidden_dim) got {input.shape}")
        input = input + self.pos_embedding
        return self.ln(self.layers(self.dropout(input)))

I can see what you are saying, I checked the implementation self.pos_embedding = nn.Parameter(torch.empty(1, seq_length, hidden_dim).normal_(std=0.02)). Thanks a lot it was too interesting to have this talk!

And for the patchify stem is utilized for enhancing the transformer performance. Right?