Hello,
I have a question regarding my classifier. Let me first explain my problem.
Input
It consists of two things: text and 3D object features
Text features are represented as: 10 x 256
, where 10
is the number of phrases in a given sentence and each phrase is represented using a 256
dimensional vector.
Object features are represented as: M x 128
, where M
is the number of objects in a given 3D scene, each represented using a 128
dimensional vector.
I then combine each text feature with the complete object features to create the fused features of shape 10 x M x 128
, where 128
is the dimension of each fused feature.
Objective
I want to learn text-to-objects mapping, where each phrase in 10 x M x 128
can be mapped to one or more objects from M
objects. Essentially, I want to train an M
number of classifiers, one for each object in M
.
Output
For the final input 10 x M x 128
, I expect the output to be 10 x M
, i.e., for each of the 10
text phrases, we get an M
dimensional vector. Values in M
will be 1
for the all objects that were classified to belong to that phrase and 0
otherwise.
Model
Now when batch_size
comes into picture, let’s say, input with batch_size 14
will be represented as 14 x 10 x M x 128
. If I reshape it to 140 x M x 128
and then pass it through the following model and finally squeeze the output from 140 x 1 x M
to 140 x M
and resize to 14 x 10 x M
. Does it make sense? Is there a better way to do this?
hidden_size = 128
M = 128
classifier = nn.Sequential(
nn.Conv1d(M, hidden_size, 1),
nn.ReLU(),
nn.BatchNorm1d(hidden_size),
nn.Conv1d(hidden_size, hidden_size, 1),
nn.ReLU(),
nn.BatchNorm1d(hidden_size),
nn.Conv1d(hidden_size, M, 1),
nn.ReLU(),
nn.BatchNorm1d(M),
nn.Conv1d(M, 1, 1)
)
Thank you for your help in advance!