The size of tensor a (146) must match the size of tensor b (1214) at non-singleton dimension 1

Hello there,

I am currently trying to create a MultiModal Emotion Recognition model using Bert and Audio Spectrogram Transformer but i ran into some issues when trying to train the data

the error code is as follows

11 frames
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/ in _wrapped_call_impl(self, *args, **kwargs)
   1516             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517         else:
-> 1518             return self._call_impl(*args, **kwargs)
   1520     def _call_impl(self, *args, **kwargs):

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/ in _call_impl(self, *args, **kwargs)
   1525                 or _global_backward_pre_hooks or _global_backward_hooks
   1526                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527             return forward_call(*args, **kwargs)
   1529         try:

<ipython-input-23-1eec2900cfb9> in forward(self, text_input, audio_input)
      8     def forward(self, text_input, audio_input):
      9         text_output = self.text_model(**text_input).hidden_states[-1][:, 0, :]
---> 10         audio_output = self.audio_model(audio_input).last_hidden_state
     11         concatenated =, audio_output), dim=-1)
     12         logits = self.classifier(concatenated)

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/ in _wrapped_call_impl(self, *args, **kwargs)
   1516             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517         else:
-> 1518             return self._call_impl(*args, **kwargs)
   1520     def _call_impl(self, *args, **kwargs):

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/ in _call_impl(self, *args, **kwargs)
   1525                 or _global_backward_pre_hooks or _global_backward_hooks
   1526                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527             return forward_call(*args, **kwargs)
   1529         try:

/usr/local/lib/python3.10/dist-packages/transformers/models/audio_spectrogram_transformer/ in forward(self, input_values, head_mask, labels, output_attentions, output_hidden_states, return_dict)
    571         return_dict = return_dict if return_dict is not None else self.config.use_return_dict
--> 573         outputs = self.audio_spectrogram_transformer(
    574             input_values,
    575             head_mask=head_mask,

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/ in _wrapped_call_impl(self, *args, **kwargs)
   1516             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517         else:
-> 1518             return self._call_impl(*args, **kwargs)
   1520     def _call_impl(self, *args, **kwargs):

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/ in _call_impl(self, *args, **kwargs)
   1525                 or _global_backward_pre_hooks or _global_backward_hooks
   1526                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527             return forward_call(*args, **kwargs)
   1529         try:

/usr/local/lib/python3.10/dist-packages/transformers/models/audio_spectrogram_transformer/ in forward(self, input_values, head_mask, output_attentions, output_hidden_states, return_dict)
    488         head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
--> 490         embedding_output = self.embeddings(input_values)
    492         encoder_outputs = self.encoder(

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/ in _wrapped_call_impl(self, *args, **kwargs)
   1516             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517         else:
-> 1518             return self._call_impl(*args, **kwargs)
   1520     def _call_impl(self, *args, **kwargs):

/usr/local/lib/python3.10/dist-packages/torch/nn/modules/ in _call_impl(self, *args, **kwargs)
   1525                 or _global_backward_pre_hooks or _global_backward_hooks
   1526                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527             return forward_call(*args, **kwargs)
   1529         try:

/usr/local/lib/python3.10/dist-packages/transformers/models/audio_spectrogram_transformer/ in forward(self, input_values)
     85         distillation_tokens = self.distillation_token.expand(batch_size, -1, -1)
     86         embeddings =, distillation_tokens, embeddings), dim=1)
---> 87         embeddings = embeddings + self.position_embeddings
     88         embeddings = self.dropout(embeddings)

RuntimeError: The size of tensor a (146) must match the size of tensor b (1214) at non-singleton dimension 1

Here it says that there is a tensor mismatch when trying to run

logits = multimodal_model(text_input, audio_input)

my multimodal_model is as follows

multimodal_model = MultimodalModel(text_model, ast_model, num_classes)

and the MultimodalModel is as follows

class MultimodalModel(nn.Module):
    def __init__(self, text_model, audio_model, num_classes):
        super(MultimodalModel, self).__init__()
        self.text_model = text_model
        self.audio_model = audio_model
        self.classifier = nn.Linear(text_model.config.hidden_size + audio_model.config.hidden_size, num_classes)

    def forward(self, text_input, audio_input):
        text_output = self.text_model(**text_input).hidden_states[-1][:, 0, :]
        audio_output = self.audio_model(audio_input).last_hidden_state
        concatenated =, audio_output), dim=-1)
        logits = self.classifier(concatenated)
        return logits

From the error code it can be seen that an error appeared when trying to run

audio_output = self.audio_model(audio_input).last_hidden_state

which makes me believe that the audio model is rejecting it like it is said in the error code where the missmatch occurs in


for reference this is my audio data shape
Train: torch.Size([9887, 128, 128])
Test: torch.Size([1094, 128, 128])

and my text data comes from this code

train_text_encoded = text_tokenizer.batch_encode_plus(train_data, truncation=True, padding=True, max_length=100, return_tensors="pt")
test_text_encoded = text_tokenizer.batch_encode_plus(test_data, truncation=True, padding=True, max_length=100, return_tensors="pt")

From these snippets what are the fault in my data / code that could’ve caused this error?
If there are any more code needed to help debug this i would be glad to show the code

Thank you

Check the shapes of both tensors here:

embeddings = embeddings + self.position_embeddings

as this addition seems to trigger the error.

how do I check the shape of the tensors? because that line of code from what I see comes from the Audio Spectrogram Model that I called if going further in the code that seems to cause the error from my end is

audio_output = self.audio_model(audio_input).last_hidden_state

but I’ve tried to check the docs but i still can’t find an answer my audio input should be
Train: torch.Size([9887, 128, 128])
where 1st index of 128 is the specified n_mel and the 2nd index of 128 is the target frames i set

Add debug print statements to the Audio Spectrogram Model to check the shapes of the mentioned tensors. I don’t know which exact model you are using and where its source code is defined.

Ahhh i see but is it possible to add a debugging statement into a model that is not in my actual code but from the transformers library?

Currently this is the model i am using

ast_model_name = "MIT/ast-finetuned-audioset-10-10-0.4593"
ast_model = AutoModelForAudioClassification.from_pretrained(ast_model_name)

The AutoModelForAudioClassification came from the transformers library

Yes, you should be able to see the file locations by printing the transformers.__path__ (assuming the model comes from transforners) and could then directly manipulate the Python files. I’m not in front of my workstation now but in case you get stuck, post a minimal and executable code snippet and I can also take a look at the issue.

Once i get right on this i will post an update thank you for the help i’ll try and do as per your recommendation when i get back to work on this