I have a question regarding my classifier. Let me first explain my problem.
It consists of two things: text and 3D object features
Text features are represented as:
10 x 256, where
10 is the number of phrases in a given sentence and each phrase is represented using a
256 dimensional vector.
Object features are represented as:
M x 128, where
M is the number of objects in a given 3D scene, each represented using a
128 dimensional vector.
I then combine each text feature with the complete object features to create the fused features of shape
10 x M x 128, where
128 is the dimension of each fused feature.
I want to learn text-to-objects mapping, where each phrase in
10 x M x 128 can be mapped to one or more objects from
M objects. Essentially, I want to train an
M number of classifiers, one for each object in
For the final input
10 x M x 128, I expect the output to be
10 x M, i.e., for each of the
10 text phrases, we get an
M dimensional vector. Values in
M will be
1 for the all objects that were classified to belong to that phrase and
batch_size comes into picture, let’s say, input with
batch_size 14 will be represented as
14 x 10 x M x 128. If I reshape it to
140 x M x 128 and then pass it through the following model and finally squeeze the output from
140 x 1 x M to
140 x M and resize to
14 x 10 x M. Does it make sense? Is there a better way to do this?
hidden_size = 128 M = 128 classifier = nn.Sequential( nn.Conv1d(M, hidden_size, 1), nn.ReLU(), nn.BatchNorm1d(hidden_size), nn.Conv1d(hidden_size, hidden_size, 1), nn.ReLU(), nn.BatchNorm1d(hidden_size), nn.Conv1d(hidden_size, M, 1), nn.ReLU(), nn.BatchNorm1d(M), nn.Conv1d(M, 1, 1) )
Thank you for your help in advance!