Tensor parallel numeric mismatch

Hi, I am applying the tensor parallel to a submodule

self.ffn = nn.Sequential(
    nn.Linear(dim, ffn_dim),
    nn.GELU(approximate="tanh"),
    nn.Linear(ffn_dim, dim),
)

the parallel plan is

"ffn": PrepareModuleInput(
    input_layouts=(Replicate(),),
    desired_input_layouts=(Replicate(),),
),
"ffn.0": ColwiseParallel(),
"ffn.2": RowwiseParallel(
    output_layouts=Replicate(),
    use_local_output=True,
),

But after the parallel, the results are not numeric matched. e.g., compute the output norm is 73.18 vs. 73.15. Is this expected or is there something wrong? Thanks.

what are the precisions you use with / without TP? Different dtypes in communication / computation could cause different numerics.