Hello, I have encountered an issue regarding the implementation of an operator.
Specifically, I input two hidden_states tensors: one with shape (1, 10, 3584) and the other with shape (1, 41, 3584).
It is important to note that the first 10 elements of b are exactly identical to those of a, and both tensors have dtype=torch.bfloat16.
Given that self.q_proj is the same Linear(in_features=3584, out_features=3584, bias=True) module in both cases, I would expect that the first 10 elements of query_states_b would match exactly with those of query_states_a.
However, I observed that there are slight differences between them. Could you please explain why this happens?
Additionally, I noticed that when the shape of a is (1, 35, 3584), this issue does not occur.
You didn’t mention how large the differences are so I assume the error represents the expected noise caused by the limited floating point precision and a different order of operations for different kernels.
Thank you very much for responding to my question!
Let me try to describe the phenomenon more clearly.
I tested two input sequences — one of length 35 and the other of length 41. The first 35 tokens of both inputs are identical. attn_output_10.shape torch.Size([1, 35, 3584]) attn_output.shape torch.Size([1, 41, 3584])
Then, I ran a decoder layer from an LLM and extracted the attention_output from the first decoder layer (layer 0).
Both outputs have the data type torch.bfloat16, meaning their numerical resolution is around 0.01. attn_output.dtype torch.bfloat16
After saving and comparing the results, I noticed that the outputs are exactly the same for the first 15 positions.
However, starting from position 16, discrepancies begin to appear — and this is exactly what I don’t understand.
Still, this issue persists, and I don’t know why.
As far as I know, this occurs at the very first decoder layer, and such discrepancies accumulate with deeper layers, eventually resulting in significant hiddenfeature differences at each input position.
Summary: I’d really appreciate it if you could let me know how to avoid this randomness.
PS: The reason I’m paying attention to those early hidden features — even the ones that don’t seem to contribute to the final generation — is because I’m using speculative decoding. But that’s a whole other story, haha.
Using deterministic algorithms won’t help since you would enable deterministic results for the same sequence length not for all kernels.
Since you are seeing differences starting from a specific sequence length another algorithm seems to be used and you could check the relative errors for bfloat16 as well as other dtypes.
dim = 1280 # lower dimensions like 128 results in 0 diff.
qkv_proj = torch.nn.Linear(dim, 3 * dim).eval().to(_DEVICE, _DTYPE)
x = torch.rand(128, dim).to(_DEVICE, _DTYPE)
with torch.inference_mode():
diff = qkv_proj(x[:54, :]) - qkv_proj(x)[:54, :]
print(diff.abs().max()) # Output: tensor(0.0020, dtype=torch.bfloat16)
print(diff.nonzero())
Smees that smaller dimensions don’t have this problem. Higher precision also result in lower error. Even if _DEVICE="cpu, we still get this behaviour. Is there a way to force the computation to be deterministic?