How to use nn.TransformerEncoder

I am trying to use nn.TransformerEncoder to intelligently downsample a (3,N) vector into a (1, N) vector that comprises of the important parts of the (3, N).

As a toy experiment, I want the nn.TransformerEncoder to learn the averaging operation such that (3,N) → transformerencoder - > (3,N)[0] → (1,N) is the average across dim=0. However, I am unable to get it working. The L1 loss remains around 0.4 which is pretty high. I would greatly appreciate any pointers for debugging.

Thank you so much!

Attached is my training code:

encoder_layers = TransformerEncoderLayer(8, 1, 64, 0.0) 
#d_model, n_head, dim_feedforward, dropout

network = nn.Sequential(nn.TransformerEncoder(encoder_layers, 5),)
criterion = nn.L1Loss()
optimizer = torch.optim.Adam(network.parameters(), lr=1e-5)

for i in tqdm(range(30000)):
        optimizer.zero_grad()
        src = torch.rand((3, 1, 8)).to(device)*2 - 1.0
        # src (3,1,8)

        gt = torch.mean(src, dim=0, keepdim=True).to(device).repeat((3,1,1))
        # gt (3,1,8)

        out = network(src)
        # out (3, 1, 8)

        loss = criterion(out, gt)
        loss.backward()
        optimizer.step()

I’m curious about the goal of your experiment.
The source and target are driven from the random then is it what you are really aiming for?

Meanwhile, your code is simply well-designed.

Thanks for the reply!

I am aiming to see if the transformer encoder can learn “torch.mean” the averaging operation. Perhaps my experiment is ill-posed. I am not sure if given my current setup we should expect the transformer encoder to learn the averaging operation.

I don’t think it can learn the averaging operation. But nice challenge :slight_smile: