I am trying to use nn.TransformerEncoder to intelligently downsample a (3,N) vector into a (1, N) vector that comprises of the important parts of the (3, N).
As a toy experiment, I want the nn.TransformerEncoder to learn the averaging operation such that (3,N) → transformerencoder - > (3,N) → (1,N) is the average across dim=0. However, I am unable to get it working. The L1 loss remains around 0.4 which is pretty high. I would greatly appreciate any pointers for debugging.
Thank you so much!
Attached is my training code:
encoder_layers = TransformerEncoderLayer(8, 1, 64, 0.0) #d_model, n_head, dim_feedforward, dropout network = nn.Sequential(nn.TransformerEncoder(encoder_layers, 5),) criterion = nn.L1Loss() optimizer = torch.optim.Adam(network.parameters(), lr=1e-5) for i in tqdm(range(30000)): optimizer.zero_grad() src = torch.rand((3, 1, 8)).to(device)*2 - 1.0 # src (3,1,8) gt = torch.mean(src, dim=0, keepdim=True).to(device).repeat((3,1,1)) # gt (3,1,8) out = network(src) # out (3, 1, 8) loss = criterion(out, gt) loss.backward() optimizer.step()