IG for VQA using VLMs from transformers

Hi there,

I would like to use IG for Image and Text to display the results of VLM models such as LLava-OneVision or Phi-3.5-Vision-instruct on a VQA task with multiple images.
I have already created a Google Colab notebook for a simple model from Transformers. However, I cannot pass the pixel_values and input_embeds together when using the llava model. Do you have any ideas how to overcome this?
Also, the results for the dandelin/vilt-b32-finetuned-vqa model don’t look quite right somehow.

I have created another notebook for llava-one-vision. Do you have an idea how to calculate the input_embedding in parallel with the pixel values for ig?
Also, could someone with a bigger GPU (only had T4 with 15 GB) validate the results with more than two steps?