I am trying to use nn.TransformerEncoder to intelligently downsample a (3,N) vector into a (1, N) vector that comprises of the important parts of the (3, N).
As a toy experiment, I want the nn.TransformerEncoder to learn the averaging operation such that (3,N) → transformerencoder - > (3,N)[0] → (1,N) is the average across dim=0. However, I am unable to get it working. The L1 loss remains around 0.4 which is pretty high. I would greatly appreciate any pointers for debugging.
Thank you so much!
Attached is my training code:
encoder_layers = TransformerEncoderLayer(8, 1, 64, 0.0)
#d_model, n_head, dim_feedforward, dropout
network = nn.Sequential(nn.TransformerEncoder(encoder_layers, 5),)
criterion = nn.L1Loss()
optimizer = torch.optim.Adam(network.parameters(), lr=1e-5)
for i in tqdm(range(30000)):
optimizer.zero_grad()
src = torch.rand((3, 1, 8)).to(device)*2 - 1.0
# src (3,1,8)
gt = torch.mean(src, dim=0, keepdim=True).to(device).repeat((3,1,1))
# gt (3,1,8)
out = network(src)
# out (3, 1, 8)
loss = criterion(out, gt)
loss.backward()
optimizer.step()