I have an encoder, which outputs a tensor with shape (bn, c * k, 32, 32)
. I now want produce k means with shape (bn, k, 1, 2)
. So the means are 2-dim coordinates. To do so, I want to use k FC Layers, while for each mean k_i I only want to use c channels.
So my idea is, that I reshape the encoder output out
to a 5d tensor with shape (bn, k, c, 32, 32)
. Then I can use the flattened out[:, 0]
… out[:, k]
as input for the k linear layers.
The trivial solution would be to define the linear layers manually:
self.fc0 = nn.Linear(c * 32 * 32, 2)
...
self.fck = nn.Linear(c * 32 * 32, 2)
Then I could define the forward pass for each mean as follows:
mean_0 = self.fc0(out[:, 0].reshape(bn, -1))
...
mean_k = self.fck(out[:, k].reshape(bn, -1))
Is there a more efficient way to do that?
Edit: To give a little more information about the background: I want to find the keypoints/landmarks of the input image. The idea is, that I assign each keypoint k a number of feature maps c. That’s, why the output of the encoder has the shape (bn, k, c, 32, 32)
. Now I want those c feature maps to predict its keypoint, which is a 2-dim coordinate. Essentially, I want a separate fc layer for each keypoint.