Tensor Broadcasting problem

Hi, I’m trying to extract InternVideo2 model’s features and i’m stuck with this error. I hope I can get some help. :

RuntimeError: The size of tensor a (9) must match the size of tensor b (96) at non-singleton dimension 1

these are my input data :

input = {}
input['input_ids'] = input_ids
input['attention_mask'] = attention_mask
input['labels'] = labels
input['video'] = video_tensor

I expect that in this page ‘line 104’ cause this error.

text_embeds[video_idx == 1] = text_embeds[video_idx == 1] * 0 + prompt_video_embeds.to(text_embeds.device).to(text_embeds.dtype)

and this is the exact code of getting text embedding

text_embeds = self.lm.get_input_embeddings()(input_ids.long()).detach()
self.word_embeddings = nn.Embedding(
            config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id
        )
...

def get_input_embeddings(self):
        return self.embeddings.word_embeddings


How can I run this code? My understanding is that two tensors (text_embeds, prompt_video_embeds) must be matched in size so that they can be added but I don’t know how.

By the way, I also don’t understand why it’s called text embedding when initialize the text embedding values and replace them with video embedding.