query_states = self.q_proj(hidden_states)
Hello, I have encountered an issue regarding the implementation of an operator.
Specifically, I input two hidden_states
tensors: one with shape (1, 10, 3584)
and the other with shape (1, 41, 3584)
.
It is important to note that the first 10 elements of b
are exactly identical to those of a
, and both tensors have dtype=torch.bfloat16
.
Given that self.q_proj
is the same Linear(in_features=3584, out_features=3584, bias=True)
module in both cases, I would expect that the first 10 elements of query_states_b
would match exactly with those of query_states_a
.
However, I observed that there are slight differences between them. Could you please explain why this happens?
Additionally, I noticed that when the shape of a
is (1, 35, 3584)
, this issue does not occur.
torch_version 2.1.2+cu121
You didn’t mention how large the differences are so I assume the error represents the expected noise caused by the limited floating point precision and a different order of operations for different kernels.
Thank you very much for responding to my question!
Let me try to describe the phenomenon more clearly.
I tested two input sequences — one of length 35 and the other of length 41. The first 35 tokens of both inputs are identical.
attn_output_10.shape torch.Size([1, 35, 3584])
attn_output.shape torch.Size([1, 41, 3584])
Then, I ran a decoder layer from an LLM and extracted the attention_output
from the first decoder layer (layer 0).
Both outputs have the data type torch.bfloat16
, meaning their numerical resolution is around 0.01.
attn_output.dtype torch.bfloat16
After saving and comparing the results, I noticed that the outputs are exactly the same for the first 15 positions.

However, starting from position 16, discrepancies begin to appear — and this is exactly what I don’t understand.
To control for such random inconsistencies or noise, I tried to enforce deterministic behavior at the beginning of the inference script.
import os
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
seed = 2025
import torch
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.use_deterministic_algorithms(True)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cuda.matmul.allow_bf16_reduced_precision_reduction = False
torch.set_float32_matmul_precision("highest")
torch.backends.cudnn.allow_tf32 = False
import random
random.seed(seed)
import numpy as np
np.random.seed(seed)
Still, this issue persists, and I don’t know why.
As far as I know, this occurs at the very first decoder layer, and such discrepancies accumulate with deeper layers, eventually resulting in significant hiddenfeature
differences at each input position.
Summary: I’d really appreciate it if you could let me know how to avoid this randomness.
PS: The reason I’m paying attention to those early hidden features — even the ones that don’t seem to contribute to the final generation — is because I’m using speculative decoding. But that’s a whole other story, haha. 
Using deterministic algorithms won’t help since you would enable deterministic results for the same sequence length not for all kernels.
Since you are seeing differences starting from a specific sequence length another algorithm seems to be used and you could check the relative errors for bfloat16
as well as other dtypes.