My task is multimodal emotion recogntion. How to fuse different modality is challenging and difficult.
the dimension of audio or text modality is sequence, batch, feature dimention
Question. bimodality = aaudio_modality + btext_modality.
how to use pytroch to realize custom learning a, b parameters? thanks
And change the final layer to ReLU, and change the final Linear layer to 64 neurons wide.
Combine both of the above into one model with one forward pass.
Then concat both of those outputs, and send through a final linear layer, 128 neurons wide for input and output should be however many classifications you need, followed by a Softmax.