Hi, I’m trying to extract InternVideo2 model’s features and i’m stuck with this error. I hope I can get some help. :
RuntimeError: The size of tensor a (9) must match the size of tensor b (96) at non-singleton dimension 1
these are my input data :
input = {}
input['input_ids'] = input_ids
input['attention_mask'] = attention_mask
input['labels'] = labels
input['video'] = video_tensor
I expect that in this page ‘line 104’ cause this error.
text_embeds[video_idx == 1] = text_embeds[video_idx == 1] * 0 + prompt_video_embeds.to(text_embeds.device).to(text_embeds.dtype)
and this is the exact code of getting text embedding
text_embeds = self.lm.get_input_embeddings()(input_ids.long()).detach()
self.word_embeddings = nn.Embedding(
config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id
)
...
def get_input_embeddings(self):
return self.embeddings.word_embeddings
How can I run this code? My understanding is that two tensors (text_embeds, prompt_video_embeds) must be matched in size so that they can be added but I don’t know how.
By the way, I also don’t understand why it’s called text embedding when initialize the text embedding values and replace them with video embedding.