Multi-label classification


I have a question regarding my classifier. Let me first explain my problem.

It consists of two things: text and 3D object features

Text features are represented as: 10 x 256, where 10 is the number of phrases in a given sentence and each phrase is represented using a 256 dimensional vector.

Object features are represented as: M x 128, where M is the number of objects in a given 3D scene, each represented using a 128 dimensional vector.

I then combine each text feature with the complete object features to create the fused features of shape 10 x M x 128, where 128 is the dimension of each fused feature.

I want to learn text-to-objects mapping, where each phrase in 10 x M x 128 can be mapped to one or more objects from M objects. Essentially, I want to train an M number of classifiers, one for each object in M.

For the final input 10 x M x 128, I expect the output to be 10 x M, i.e., for each of the 10 text phrases, we get an M dimensional vector. Values in M will be 1 for the all objects that were classified to belong to that phrase and 0 otherwise.

Now when batch_size comes into picture, let’s say, input with batch_size 14 will be represented as 14 x 10 x M x 128. If I reshape it to 140 x M x 128 and then pass it through the following model and finally squeeze the output from 140 x 1 x M to 140 x M and resize to 14 x 10 x M. Does it make sense? Is there a better way to do this?

hidden_size = 128
M = 128

classifier = nn.Sequential(
    nn.Conv1d(M, hidden_size, 1),
    nn.Conv1d(hidden_size, hidden_size, 1),
    nn.Conv1d(hidden_size, M, 1),
    nn.Conv1d(M, 1, 1)

Thank you for your help in advance!