Inconsistent Output for Identical Inputs When Using Linear Projection with Different squence length

seamoonlight-YBY · April 27, 2025, 10:28am

query_states = self.q_proj(hidden_states)

Hello, I have encountered an issue regarding the implementation of an operator.
Specifically, I input two hidden_states tensors: one with shape (1, 10, 3584) and the other with shape (1, 41, 3584).
It is important to note that the first 10 elements of b are exactly identical to those of a, and both tensors have dtype=torch.bfloat16.
Given that self.q_proj is the same Linear(in_features=3584, out_features=3584, bias=True) module in both cases, I would expect that the first 10 elements of query_states_b would match exactly with those of query_states_a.
However, I observed that there are slight differences between them. Could you please explain why this happens?
Additionally, I noticed that when the shape of a is (1, 35, 3584), this issue does not occur.

torch_version 2.1.2+cu121

ptrblck · April 27, 2025, 12:51pm

You didn’t mention how large the differences are so I assume the error represents the expected noise caused by the limited floating point precision and a different order of operations for different kernels.

seamoonlight-YBY · April 29, 2025, 11:37am

Thank you very much for responding to my question!
Let me try to describe the phenomenon more clearly.

I tested two input sequences — one of length 35 and the other of length 41. The first 35 tokens of both inputs are identical.
attn_output_10.shape torch.Size([1, 35, 3584])
attn_output.shape torch.Size([1, 41, 3584])
Then, I ran a decoder layer from an LLM and extracted the attention_output from the first decoder layer (layer 0).
Both outputs have the data type torch.bfloat16, meaning their numerical resolution is around 0.01.
attn_output.dtype torch.bfloat16
After saving and comparing the results, I noticed that the outputs are exactly the same for the first 15 positions.

However, starting from position 16, discrepancies begin to appear — and this is exactly what I don’t understand.

To control for such random inconsistencies or noise, I tried to enforce deterministic behavior at the beginning of the inference script.

import os
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
seed = 2025
import torch
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.use_deterministic_algorithms(True)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cuda.matmul.allow_bf16_reduced_precision_reduction = False
torch.set_float32_matmul_precision("highest")
torch.backends.cudnn.allow_tf32 = False
import random
random.seed(seed)
import numpy as np
np.random.seed(seed)

Still, this issue persists, and I don’t know why.
As far as I know, this occurs at the very first decoder layer, and such discrepancies accumulate with deeper layers, eventually resulting in significant hiddenfeature differences at each input position.

Summary: I’d really appreciate it if you could let me know how to avoid this randomness.

PS: The reason I’m paying attention to those early hidden features — even the ones that don’t seem to contribute to the final generation — is because I’m using speculative decoding. But that’s a whole other story, haha.

ptrblck · April 30, 2025, 12:43am

Using deterministic algorithms won’t help since you would enable deterministic results for the same sequence length not for all kernels.
Since you are seeing differences starting from a specific sequence length another algorithm seems to be used and you could check the relative errors for bfloat16 as well as other dtypes.

jackshi · June 16, 2025, 10:40am

Can reproduce with:

dim = 1280  # lower dimensions like 128 results in 0 diff.
qkv_proj = torch.nn.Linear(dim, 3 * dim).eval().to(_DEVICE, _DTYPE)
x = torch.rand(128, dim).to(_DEVICE, _DTYPE)

with torch.inference_mode():
    diff = qkv_proj(x[:54, :]) - qkv_proj(x)[:54, :]

print(diff.abs().max())  # Output: tensor(0.0020, dtype=torch.bfloat16)
print(diff.nonzero())

Smees that smaller dimensions don’t have this problem. Higher precision also result in lower error. Even if _DEVICE="cpu, we still get this behaviour. Is there a way to force the computation to be deterministic?