Inconsistent Output for Identical Inputs When Using Linear Projection with Different squence length

query_states = self.q_proj(hidden_states)

Hello, I have encountered an issue regarding the implementation of an operator.
Specifically, I input two hidden_states tensors: one with shape (1, 10, 3584) and the other with shape (1, 41, 3584).
It is important to note that the first 10 elements of b are exactly identical to those of a, and both tensors have dtype=torch.bfloat16.
Given that self.q_proj is the same Linear(in_features=3584, out_features=3584, bias=True) module in both cases, I would expect that the first 10 elements of query_states_b would match exactly with those of query_states_a.
However, I observed that there are slight differences between them. Could you please explain why this happens?
Additionally, I noticed that when the shape of a is (1, 35, 3584), this issue does not occur.

torch_version 2.1.2+cu121

You didn’t mention how large the differences are so I assume the error represents the expected noise caused by the limited floating point precision and a different order of operations for different kernels.

Thank you very much for responding to my question!
Let me try to describe the phenomenon more clearly.

I tested two input sequences — one of length 35 and the other of length 41. The first 35 tokens of both inputs are identical.
attn_output_10.shape torch.Size([1, 35, 3584])
attn_output.shape torch.Size([1, 41, 3584])
Then, I ran a decoder layer from an LLM and extracted the attention_output from the first decoder layer (layer 0).
Both outputs have the data type torch.bfloat16, meaning their numerical resolution is around 0.01.
attn_output.dtype torch.bfloat16
After saving and comparing the results, I noticed that the outputs are exactly the same for the first 15 positions.
image
However, starting from position 16, discrepancies begin to appear — and this is exactly what I don’t understand.


To control for such random inconsistencies or noise, I tried to enforce deterministic behavior at the beginning of the inference script.

import os
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
seed = 2025
import torch
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.use_deterministic_algorithms(True)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cuda.matmul.allow_bf16_reduced_precision_reduction = False
torch.set_float32_matmul_precision("highest")
torch.backends.cudnn.allow_tf32 = False
import random
random.seed(seed)
import numpy as np
np.random.seed(seed)

Still, this issue persists, and I don’t know why.
As far as I know, this occurs at the very first decoder layer, and such discrepancies accumulate with deeper layers, eventually resulting in significant hiddenfeature differences at each input position.

Summary: I’d really appreciate it if you could let me know how to avoid this randomness.

PS: The reason I’m paying attention to those early hidden features — even the ones that don’t seem to contribute to the final generation — is because I’m using speculative decoding. But that’s a whole other story, haha. :upside_down_face:

Using deterministic algorithms won’t help since you would enable deterministic results for the same sequence length not for all kernels.
Since you are seeing differences starting from a specific sequence length another algorithm seems to be used and you could check the relative errors for bfloat16 as well as other dtypes.