I am using a pre-trained vision transformer from Pytorch available models. The question now, if I am using the pre-trained weights and I am applying the same transformations applied to the during training on Image-Net as an example, does is by nature do the reshaping of the input data?
Yeah, but I remember that we need to do the patching and the positional encoding thing with our hands right?
I implemented the TransUnet architecture before, yet I don’t if the VIT in pytorch is doing it somehow, I was investigating the architecture and I found this conv-proj which was actually some CNN with specific kernel size and stride to do the reshaping thing. Also, I see your post taking about the “Stem” thing before. Can you just explain more what are the differences?
The STEM just adds some additional conv2d layers in the front before the ViT. You might think of it as preparing the patches to be more easily “recognized” by the embedding layer.
what is more puzzling the positional encoding part, I can’t understand how is it happening?
class Encoder(nn.Module):
"""Transformer Model Encoder for sequence to sequence translation."""
def __init__(
self,
seq_length: int,
num_layers: int,
num_heads: int,
hidden_dim: int,
mlp_dim: int,
dropout: float,
attention_dropout: float,
norm_layer: Callable[..., torch.nn.Module] = partial(nn.LayerNorm, eps=1e-6),
):
super().__init__()
# Note that batch_size is on the first dim because
# we have batch_first=True in nn.MultiAttention() by default
self.pos_embedding = nn.Parameter(torch.empty(1, seq_length, hidden_dim).normal_(std=0.02)) # from BERT
self.dropout = nn.Dropout(dropout)
layers: OrderedDict[str, nn.Module] = OrderedDict()
for i in range(num_layers):
layers[f"encoder_layer_{i}"] = EncoderBlock(
num_heads,
hidden_dim,
mlp_dim,
dropout,
attention_dropout,
norm_layer,
)
self.layers = nn.Sequential(layers)
self.ln = norm_layer(hidden_dim)
def forward(self, input: torch.Tensor):
torch._assert(input.dim() == 3, f"Expected (batch_size, seq_length, hidden_dim) got {input.shape}")
input = input + self.pos_embedding
return self.ln(self.layers(self.dropout(input)))
I can see what you are saying, I checked the implementation self.pos_embedding = nn.Parameter(torch.empty(1, seq_length, hidden_dim).normal_(std=0.02)). Thanks a lot it was too interesting to have this talk!
And for the patchify stem is utilized for enhancing the transformer performance. Right?
The embedding layer is just being initialized in the init, giving some random values before training begins. That is used to assign some position information to the transformer, such as first patch, second patch, etc. Similar to how a positional encoder is used in a language model.
STEM is a way to improve the accuracy of a ViT model, but requires additional training if added later.
lass Encoder(nn.Module):
"""Transformer Model Encoder for sequence to sequence translation."""
def __init__(
self,
seq_length: int,
num_layers: int,
num_heads: int,
hidden_dim: int,
mlp_dim: int,
dropout: float,
attention_dropout: float,
norm_layer: Callable[..., torch.nn.Module] = partial(nn.LayerNorm, eps=1e-6),
):
super().__init__()
# Note that batch_size is on the first dim because
# we have batch_first=True in nn.MultiAttention() by default
self.pos_embedding = nn.Parameter(torch.empty(1, seq_length, hidden_dim).normal_(std=0.02)) # from BERT
self.dropout = nn.Dropout(dropout)
layers: OrderedDict[str, nn.Module] = OrderedDict()
for i in range(num_layers):
layers[f"encoder_layer_{i}"] = EncoderBlock(
num_heads,
hidden_dim,
mlp_dim,
dropout,
attention_dropout,
norm_layer,
)
self.layers = nn.Sequential(layers)
self.ln = norm_layer(hidden_dim)
def forward(self, input: torch.Tensor):
torch._assert(input.dim() == 3, f"Expected (batch_size, seq_length, hidden_dim) got {input.shape}")
input = input + self.pos_embedding
return self.ln(self.layers(self.dropout(input)))
I can see what you are saying, I checked the implementation self.pos_embedding = nn.Parameter(torch.empty(1, seq_length, hidden_dim).normal_(std=0.02)). Thanks a lot it was too interesting to have this talk!
And for the patchify stem is utilized for enhancing the transformer performance. Right?