PyTorch Tensor Parallel

Is torch implementation of DTensor materially different from the MegatronLM TP? In Megatron it’s recommended to set CUDA_DEVICE_MAX_CONNECTIONS to 1 to enable TP communication overlap. Is it also true for torch DTensor TP? I got an answer from torch titan issue but I hope to get more context about the differences between torch TP and Megatron TP. Thanks!