Vision-Transformer re-shaping the input

Hi everyone,

I am using a pre-trained vision transformer from Pytorch available models. The question now, if I am using the pre-trained weights and I am applying the same transformations applied to the during training on Image-Net as an example, does is by nature do the reshaping of the input data?

pre_trained_weights = ViT_L_16_Weights.IMAGENET1K_V1

transforms = pre_trained_weights.transforms()

I mean something like this.

Thanks :heart:

Unsure, if I understand the question properly, but if you are wondering if the transformation contains a Resize, just print it.

1 Like

Yeah, but not exactly. we need to do patching thing to make the data suitable for the transformer, is it done automatically?

In a ViT model, the model takes in dimensions (batch_size, channels, height, width), just like any standard vision model.

1 Like

Yeah, but I remember that we need to do the patching and the positional encoding thing with our hands right?
I implemented the TransUnet architecture before, yet I don’t if the VIT in pytorch is doing it somehow, I was investigating the architecture and I found this conv-proj which was actually some CNN with specific kernel size and stride to do the reshaping thing. Also, I see your post taking about the “Stem” thing before. Can you just explain more what are the differences?

Is there positional encoding?

Thanks
:heart:

A typical ViT in PyTorch will already handle patching the image. So the minimum input size will be specified with the model:

https://pytorch.org/vision/main/models/generated/torchvision.models.vit_b_16.html#torchvision.models.vit_b_16

The STEM just adds some additional conv2d layers in the front before the ViT. You might think of it as preparing the patches to be more easily “recognized” by the embedding layer.

1 Like

what is more puzzling the positional encoding part, I can’t understand how is it happening?

class Encoder(nn.Module):
    """Transformer Model Encoder for sequence to sequence translation."""

    def __init__(
        self,
        seq_length: int,
        num_layers: int,
        num_heads: int,
        hidden_dim: int,
        mlp_dim: int,
        dropout: float,
        attention_dropout: float,
        norm_layer: Callable[..., torch.nn.Module] = partial(nn.LayerNorm, eps=1e-6),
    ):
        super().__init__()
        # Note that batch_size is on the first dim because
        # we have batch_first=True in nn.MultiAttention() by default
        self.pos_embedding = nn.Parameter(torch.empty(1, seq_length, hidden_dim).normal_(std=0.02))  # from BERT
        self.dropout = nn.Dropout(dropout)
        layers: OrderedDict[str, nn.Module] = OrderedDict()
        for i in range(num_layers):
            layers[f"encoder_layer_{i}"] = EncoderBlock(
                num_heads,
                hidden_dim,
                mlp_dim,
                dropout,
                attention_dropout,
                norm_layer,
            )
        self.layers = nn.Sequential(layers)
        self.ln = norm_layer(hidden_dim)

    def forward(self, input: torch.Tensor):
        torch._assert(input.dim() == 3, f"Expected (batch_size, seq_length, hidden_dim) got {input.shape}")
        input = input + self.pos_embedding
        return self.ln(self.layers(self.dropout(input)))

I can see what you are saying, I checked the implementation self.pos_embedding = nn.Parameter(torch.empty(1, seq_length, hidden_dim).normal_(std=0.02)). Thanks a lot it was too interesting to have this talk!:heart:

And for the patchify stem is utilized for enhancing the transformer performance. Right?

The embedding layer is just being initialized in the init, giving some random values before training begins. That is used to assign some position information to the transformer, such as first patch, second patch, etc. Similar to how a positional encoder is used in a language model.

STEM is a way to improve the accuracy of a ViT model, but requires additional training if added later.

1 Like
lass Encoder(nn.Module):
    """Transformer Model Encoder for sequence to sequence translation."""

    def __init__(
        self,
        seq_length: int,
        num_layers: int,
        num_heads: int,
        hidden_dim: int,
        mlp_dim: int,
        dropout: float,
        attention_dropout: float,
        norm_layer: Callable[..., torch.nn.Module] = partial(nn.LayerNorm, eps=1e-6),
    ):
        super().__init__()
        # Note that batch_size is on the first dim because
        # we have batch_first=True in nn.MultiAttention() by default
        self.pos_embedding = nn.Parameter(torch.empty(1, seq_length, hidden_dim).normal_(std=0.02))  # from BERT
        self.dropout = nn.Dropout(dropout)
        layers: OrderedDict[str, nn.Module] = OrderedDict()
        for i in range(num_layers):
            layers[f"encoder_layer_{i}"] = EncoderBlock(
                num_heads,
                hidden_dim,
                mlp_dim,
                dropout,
                attention_dropout,
                norm_layer,
            )
        self.layers = nn.Sequential(layers)
        self.ln = norm_layer(hidden_dim)

    def forward(self, input: torch.Tensor):
        torch._assert(input.dim() == 3, f"Expected (batch_size, seq_length, hidden_dim) got {input.shape}")
        input = input + self.pos_embedding
        return self.ln(self.layers(self.dropout(input)))

I can see what you are saying, I checked the implementation self.pos_embedding = nn.Parameter(torch.empty(1, seq_length, hidden_dim).normal_(std=0.02)). Thanks a lot it was too interesting to have this talk!:heart:

And for the patchify stem is utilized for enhancing the transformer performance. Right?

​​

​​