I am running the fsdp_tp_example from the PyTorch examples repository.

I understood that tensor parallelism distributes weights and activations across devices. For example, if we apply ColWiseLinear to a Linear layer, I expected it to distribute parts of the weight tensor, such as `weight[:tp_size, :]`

, across multiple devices. However, after running the fsdp_tp_example, I observed that it actually keeps the entire model parameters on each GPU. For instance, in the case of the LLaMA 7B model, the full 24GB of parameters remain on each GPU even after initializing with `parallize_module`

.

Is this the expected behavior? I thought the main purpose of tensor parallelism was to fit large models on multiple GPUs by splitting the weights, but it seems like the full weights reside on each device, and only sub-parts of the weights are used for calculations.