distributed

torchtitan Use this subcategory for all discussions related to end to end composability of distributed technologies using torchtitan. distributed-rpc

Topic	Replies	Views	Activity
About the distributed category distributed	1	2795	January 20, 2021
Variable batch size in Multi-GPU trainings distributed	3	28	July 31, 2025
Question about GPU memory usage when using pipeline parallelism training under larger micro batch count torchtitan	4	61	July 30, 2025
Can I shard a subset of weights and replicate others in FSDP2? distributed	0	9	July 30, 2025
Gradient not accumulated across nodes in deepspeed code distributed	2	56	July 27, 2025
Ddp training and eval question distributed	2	18	July 26, 2025
Continued pre-training large models with FSDP2? distributed	2	40	July 26, 2025
FullyShardedDataParallel hangs depending on wrap policy for Llama-3.2-1B distributed	1	43	July 26, 2025
Why would functional and non-functional broadcast use `src` with different semantics? distributed	1	23	July 26, 2025
Why does init_device_mesh() or DeviceMesh() have to be called globally? distributed	3	46	July 22, 2025
Work vs. Future sync primitives for Distributed Torch backends distributed	1	41	July 21, 2025
[DCP] how to load dcp ckpts? distributed	3	29	July 21, 2025
Dist.all_gather with uneven tensor sizes distributed	1	38	July 20, 2025
Does elastic torch support model parallelism distributed	1	22	July 20, 2025
FSDP2 and gradient w.r.t. inputs distributed	1	25	July 20, 2025
Gathering dictionaries of DistributedDataParallel distributed	11	4013	July 17, 2025
Copying params between 2 identically sharded (FSDP) networks distributed	2	47	July 16, 2025
Split backward into multiple gpus distributed	2	56	July 16, 2025
NCCL timeout when reducing batch size distributed	1	42	July 15, 2025
How to Efficiently Gather Python Objects Across GPUs Without GPU-to-CPU-to-GPU-to-CPU Overhead in torch.distributed? distributed	1	31	July 11, 2025
torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3368, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.25.1 distributed	2	108	July 8, 2025
FSDP1/FSDP2 sharding across CPUs? distributed	2	41	July 8, 2025
NotImplementedError: could not run 'aten::as_strided' distributed	1	27	July 7, 2025
BatchNorm for multi GPU Training distributed	9	4661	July 7, 2025
DTensor TP collectives missing? distributed	0	29	July 3, 2025
Capture training graph with collectives via TorchTitan torchtitan	6	94	June 28, 2025
[Distributed w/ TorchTitan] FLUX is Here: Experience Diffusion Model Training on TorchTitan torchtitan	0	705	June 27, 2025
[Distributed w/ TorchTitan] Breaking Barriers: Training Long Context LLMs with 1M Sequence Length in PyTorch Using Context Parallel torchtitan	9	6936	June 27, 2025
Socket error - broken pipe during rendezvous distributed	7	162	June 24, 2025
Using IterableDataset with DistributedDataParallel distributed	9	10203	June 23, 2025