Tensor parallel numeric mismatch

what are the precisions you use with / without TP? Different dtypes in communication / computation could cause different numerics.