Attention Layer whether to play a role in the network feedforward?

Attention Layer whether to play a role in the network feedforward?

I know, Attention plays a role in training, but it’s needed for testing?

I find the torch.nn.Dropout in the attention.

I think the Dropout layer may will not be used in testing.

class Attention(nn.Module):
    def __init__(self, dim, heads=8, dim_head=64, dropout=0.):
        super().__init__()
        inner_dim = dim_head * heads
        project_out = not (heads == 1 and dim_head == dim)

        self.heads = heads
        self.scale = dim_head ** -0.5

        self.attend = nn.Softmax(dim=-1)
        self.to_qkv = nn.Linear(dim, inner_dim * 3, bias=False)

        self.to_out = nn.Sequential(
            nn.Linear(inner_dim, dim),
            nn.Dropout(dropout)
        ) if project_out else nn.Identity()

    def forward(self, x):
        qkv = self.to_qkv(x).chunk(3, dim=-1)
        q, k, v = map(lambda t: rearrange(t, 'b p n (h d) -> b p h n d', h=self.heads), qkv)

        dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale
        attn = self.attend(dots)
        out = torch.matmul(attn, v)
        out = rearrange(out, 'b p h n d -> b p n (h d)')
        return self.to_out(out)


ps: awesome_lightweight_networks/mobile_vit.py at 3917c7f919bdd5c445b07e6df617f96f1392321f · murufeng/awesome_lightweight_networks (github.com)

Unlike dropout, attention layer is used while testing/ performing inference as well.
Dropout is a regularization technique that turns off random neurons while training. You can read more about dropout here.
Attention on the other hand is used to enhance certain parts of the input data, telling the network which regions should be given extra emphasis to. This blog gives a good overview of attention mechanism. While attention and dropout seems similar, a neural network learns the attention map while training, and unlike dropout it is not random in nature.