LLaVA Steering: Why does grounding fix hallucinations in captioning but not in Yes/No QA?

Hi everyone,

I am working on an inference-time steering method for LLaVA-1.5-7b to improve visual grounding. My method works by monitoring the attention layers during generation. If the model’s attention to the image features drops below a certain threshold (i.e., if it starts ignoring the image tokens), my mechanism intervenes to boost the attention scores back onto the visual tokens.

I have verified that the intervention is active and working mechanically. However, I am observing a stark contrast in how this affects downstream performance on two standard hallucination benchmarks:

1. Generative Captioning (CHAIR Benchmark): Success
In free-form captioning tasks (“Describe this image…”), the method works exactly as intended. It prevents the model from “drifting” away from the image. For example, if the model is about to hallucinate an object based on text probability (e.g., seeing a “table” implies “chairs”), the steering forces it to look back at the image pixels, effectively correcting the hallucination.

2. Binary QA (POPE Benchmark): Failure
In Yes/No probing tasks (“Is there a dog?”), the exact same mechanism fails to correct the model’s bias.

  • The Scenario: I feed the model an image without a dog and ask “Is there a dog?”.

  • The Behavior: The model initially prepares to answer “Yes” based on its internal instruction-following bias.

  • The Intervention: My method detects low attention to the image and forces the model to attend strongly to the image features.

  • The Result: PROBABLY due to the fact that the image contains no dog, the model attends to the “background.” It then still answers “Yes.”


My Hypothesis & Questions for the Community:

I suspect this is due to a fundamental difference in how VLMs handle “Grounding” (looking) vs “Answering” (deciding). I would appreciate any insights on the following:

A. The “Prefill Gap” (Timing)
In a short QA task like “Is there a dog?”, does LLaVA form its decision logic entirely during the prefill (prompt processing) stage?

  • My method currently only steers the decoding steps (token generation).

  • Question: Has anyone successfully changed a VLM’s answer by steering only the generation phase, or is the answer “baked in” to the KV cache of the prompt? Do I need to intervene on the prompt tokens themselves?

B. The “Signal of Absence”
When I force the model to attend to the image in the absence of the object, the attention mass shifts to the background scenery.

  • Question: Does the VLM interpret “high attention anywhere in the image” as evidence of presence? If so, simply “looking at the image” is statistically indistinguishable from “finding the object” for the attention head. How do standard steering methods handle “absence” queries?

C. Layer Specificity
I am currently steering distinct visual attention heads in the middle/late layers (12-28).

  • Question: In LLaVA, are the “Grounding” heads (responsible for finding objects) distinct from the “Answering” heads (responsible for outputting Yes/No)? Is it possible to steer visual processing without propagating that signal to the decision-making layers?

Any advice on debugging this “Grounding vs. Faithfulness” gap would be appreciated!